Preface
In today’s data-driven world, the ability to analyze and interpret data has become an essential skill for individuals and organizations alike. Statistical analysis, which involves using mathematical methods to analyze and draw
conclusions from data, is one of the most powerful tools available for this purpose.
While statistical analysis can be performed using various software programs, Microsoft Excel remains one of the
most widely used tools for data analysis. Its user-friendly interface, versatile features, and widespread availability
make it a popular choice for data analysis, especially for those who are new to statistical analysis.
This book, “Mastering Statistical Analysis Using Excel,” is designed to provide readers with a comprehensive
guide to using Excel for statistical analysis. Whether you are a beginner or an experienced user of Excel, this
book will help you master the fundamentals of statistical analysis and learn how to use Excel to perform a wide
range of statistical analyses.
The book is organized into chapters that cover different statistical techniques, starting with basic descriptive statistics and progressing to more advanced techniques such as hypothesis testing, regression analysis, and ANOVA.
Each chapter includes clear explanations of the concepts, step-by-step instructions for performing the analysis in
Excel, and examples to illustrate how to apply the techniques to real-world data.
Throughout the book, we focus on practical applications of statistical analysis, with a particular emphasis on using Excel to solve real-world problems. We also include tips and tricks for optimizing your use of Excel, including keyboard shortcuts, Excel functions, and add-ins that can help streamline your analysis.
We believe that this book will be a valuable resource for anyone looking to improve their skills in statistical
analysis using Excel. Whether you are a student, a business professional, or a researcher, the techniques and tools
covered in this book will help you gain valuable insights from your data and make informed decisions based on
your findings.
Contents
Chapter 1
Introduction
Page 7
Definition of data
Page 7
Features of ideal data analysis software
Page 8
Why software should be used for data analysis
Page 9
Preparing Excel for data analysis
Page 10
Components of data analysis plug in
Page 15
Chapter 2
Percentages
Page 16
Pitfalls
Page 19
Variants of percentage calculations
Page 20
Example of percent increase and decrease formula
Page 22
Chapter 3
Role of ratios in statistical analysis
Page 28
Pitfalls in ratio analysis
Page 29
Using Excel to calculate ratios
Page 30
Types of ratios
Page 35
Probability ratio
Page 36
Efficiency ratio
Page 40
Using Excel to calculate efficiency ratio
Page 41
Liquidity ratio
Page 42
Performance ratio
Page 45
Growth ratio
Page 47
Leverage ratio
Page 47
Chapter 4
Overview of Datasets
Page 52
Frequency tables and graphs
Page 52
Line graphs Bar graphs and polygon
Page 56
Frequency distribution of a dataset
Page 59
Structured datasets
Page 83
Unstructured datasets
Page 84
Time series datasets
Page 84
Cross sectional dataset
Page 85
Longitudinal dataset
Page 86
Panel dataset
Page 87
Spatial dataset
Page 87
Simulation dataset
Page 88
Graph dataset
Page 89
Analyzing the dataset
Page 90
Chapter 5
Data Description
Page 92
Identifying variables
Page 92
Data distribution
Page 93
Analyzing data distribution using Excel
Page 95
Chapter 6
Single Factor ANOVA
Page 104
Chapter 7
Anova – Two factor with replication
Page 110
Pitfalls
Page 117
Scenarios for using this evaluation
Page 117
Advantages
Page 118
Chapter 8
Anova – Two factor without replication
Page 120
Using Excel to perform this test
Page 123
Pitfalls
Page 126
Chapter 9
Correlation
Page 128
Using Excel to calculate correlation
Page 131
Creating scatterplot to identify correlation
Page 134
Types of correlation
Page 136
Chapter 10
Descriptive statistics
Page 156
Measures of central tendency
Page 157
Mean
Page 157
Using Excel to calculate the mean of a dataset
Page 158
Median
Page 162
Mode
Page 165
Measures of variability
Page 167
Range
Page 167
Using Excel to calculate the range of a dataset
Page 168
Calculating variance using Excel
Page 170
Standard deviation
Page 171
Using Excel to calculate standard deviation
Page 172
Frequency distribution
Page 174
Histograms
Page 182
Scatterplots
Page 188
Measures of Association
Page 194
Chi-square Test
Page 197
Odds ratio
Page 199
Using descriptive statistics function
Page 204
Chapter 11
Chi-Square test
Page 208
Chapter 12
Exponential smoothening
Page 220
Performing Exponential smoothening using Excel
Page 221
Chapter 13
F-Test Two-sample for variances
Page 226
Chapter 14
Fourier Analysis
Page 240
Using Excel to perform Fourier analysis
Page 241
Chapter 15
Histogram
Page 246
Steps to understand a histogram
Page 247
Types of data distribution seen in Histogram
Page 248
Bimodal Histogram
Page 256
Uniform distribution Histogram
Page 258
Chapter 16
Moving Average
Page 260
Advantages of moving average
Page 260
Calculating moving average using Excel
Page 261
Manual calculation of moving average using Excel
Page 267
Chapter 17
Random Number generation
Page 274
Uses of Random number in statistics
Page 274
Generating Random numbers using Excel
Page 275
Chapter 18
Rank and Percentile
Page 282
Chapter 19
Regression
Page 290
Simple linear regression
Page 290
Multiple linear regression
Page 296
Chapter 20
Sampling
Page 300
Random Sampling
Page 300
Performing Random sampling using Excel
Page 301
Sampling errors
Page 305
Cluster sampling
Page 312
Convenience sampling
Page 313
Chapter 21
T-Test: Two sample assuming equal variances
Page 314
Chapter 22
T-Test Paired two samples for means
Page 324
Chapter 23
T-Test Two sample assuming unequal variances
Page 328
Chapter 24
Z test Two sample for means
Page 332
Chapter 25
Pivot table
Page 338
Ways to query large data using pivot table
Page 339
Data filtering
Page 340
Drilling down to details
Page 344
Creating calculated fields
Page 345
Pivot slicers
Page 357
Chapter 26
Data cleaning
Page 364
Identifying data inconsistencies
Page 365
Removing duplicate data
Page 366
Locating blank data
Page 368
Correcting errors and mis-spellings
Page 375
Removing outliers
Page 378
Percentile based method for identifying outliers
Page 383
Data transformation
Page 385
Data integration
Page 389
Chapter 27
Data visualization
Page 394
Line graphs
Page 398
Scatterplots
Page 400
Pie charts
Page 405
Doughnut chart
Page 412
3-D pie charts
Page 414
Stacked chart
Page 416
Heat maps
Page 417
Tree maps
Page 420
Geographic maps
Page 423
Bubble charts
Page 426
Chapter 28
Data Mining
Page 430
Types of data mining
Page 431
Clustering data
Page 437
Association rule mining
Page 440
Anomaly detection
Page 445
Sequence mining
Page 452
Chapter 29
Importing data into Excel
Page 460
Chapter 30
Data Transformation
Page 465
Chapter 31
Analyze data tab
Page 470
Chapter 32
Grouping function in Excel
Page 474
About the Author
Prof Dr Balasubramanian Thiagarajan M.S. D.L.O.
Former Registrar The Tamilnadu Dr MGR Medical University Guindy Chennai
Former Professor and Head Department of Otolaryngology Stanley Medical College Chennai
Currently Dean Sri Lalithambigai Medical College
Madurovoil Chennai
Author contact Email:
[email protected]
Introduction
1
Definition of Data:
ata refers to a collection of facts, figures, statistics, or other pieces of information that are typically stored in a structured or unstructured format. Data can be in various forms such as numbers, text, images, videos, or audio recordings. It is a fundamental building block of information, and
its value lies in its ability to be analyzed and processed to extract insights and knowledge.
D
In today’s digital age, data is generated at an unprecedented rate from various sources, including
sensors, social media, mobile devices, and internet activity. This data is often big, complex, and
diverse, requiring advanced tools and techniques for its management and analysis. The field of
data science has emerged to help individuals and organizations leverage the power of data to drive
innovation, improve decision-making, and solve complex problems.
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover
useful information, draw conclusions, and support decision-making. It involves applying statistical and computational techniques to large and complex datasets to identify patterns, trends, and
relationships.
The main steps involved in data analysis include:
1. Data collection: Gathering data from various sources such as surveys, experiments, or databases.
2. Data cleaning: Scrubbing and verifying data to ensure its accuracy and completeness.
3. Data transformation: Organizing, formatting, and aggregating data in preparation for analysis.
4. Data modeling: Applying statistical or machine learning algorithms to uncover patterns and
relationships within the data.
5. Data visualization: Representing the analyzed data in charts, graphs, or other visual formats to
help communicate the findings.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
Data analysis is used in various fields, including business, finance, healthcare, education, and social
sciences, to gain insights into trends and patterns that can inform decision-making and help organizations achieve their goals.
Importance of data analysis:
Data analysis is crucial in today’s information-driven world for several reasons:
1. Better decision-making: Data analysis helps individuals and organizations make better-informed
decisions by providing insights into patterns, trends, and relationships within the data. By understanding the data, decision-makers can identify opportunities, mitigate risks, and optimize outcomes.
2. Improved performance: Data analysis enables organizations to measure and track performance
metrics, identify areas for improvement, and optimize processes to achieve better outcomes. This can
help companies become more efficient, reduce costs, and improve customer satisfaction.
3. Competitive advantage: Data analysis can provide organizations with a competitive edge by uncovering insights that others may not be aware of. By using data to inform decisions, companies can
develop innovative products, target specific customer segments, and respond quickly to changes in the
market.
4. Predictive capabilities: Data analysis can help organizations make predictions about future events
or trends by analyzing historical data. This can be useful in a variety of contexts, including finance,
healthcare, and marketing.
5. Personalization: Data analysis enables organizations to personalize their products or services to
better meet the needs of individual customers. By analyzing customer data, companies can identify
preferences, behavior patterns, and other factors that can inform product development, marketing
strategies, and customer service.
Overall, data analysis is a critical tool for individuals and organizations that want to make better decisions, improve performance, and stay ahead of the competition.
Introduction to data analysis software:
Data analysis software is a type of computer program designed to help users collect, manage, analyze, and visualize large sets of data. These programs are used in a wide range of industries, including
finance, healthcare, marketing, and scientific research, among others. Some popular data analysis
software includes Microsoft Excel, R, Python, SAS, SPSS, and Tableau.
Microsoft Excel is a spreadsheet program that allows users to organize and analyze data using various
functions and tools. It is commonly used in business and finance to perform financial analysis, create
budgets, and track expenses.
R and Python are programming languages commonly used in data analysis and statistical modeling.
They offer a wide range of statistical tools, data visualization libraries, and machine learning algorithms. These languages are popular among researchers and data scientists for their flexibility and
ability to handle large datasets.
Mastering Statistical Analysis with Excel
8
SAS and SPSS are statistical analysis software programs used in a variety of industries, including
healthcare, government, and finance. They provide advanced statistical analysis tools, data visualization options, and reporting capabilities.
Tableau is a business intelligence and data visualization software used to create interactive visualizations, dashboards, and reports. It allows users to connect to various data sources and create compelling data stories.
In summary, data analysis software provides tools and techniques for analyzing and visualizing data,
allowing users to gain insights and make informed decisions based on their findings. The choice of
software depends on the type of analysis required, the size of the dataset, and the user’s familiarity
with the software.
Features of ideal data analysis software:
An ideal data analysis software should have the following features:
1. Data Management: The software should have features to import, export, and manage large datasets. It should allow the user to clean and preprocess the data, including handling missing data,
removing duplicates, and transforming variables.
2. Statistical Analysis: The software should provide a wide range of statistical tools and techniques for
analyzing data, including descriptive statistics, hypothesis testing, regression analysis, and time-series analysis.
3. Data Visualization: The software should provide various tools to create compelling visualizations
of the data, including charts, graphs, and interactive dashboards. It should allow the user to customize the visuals to meet their specific needs.
4. Machine Learning: The software should provide a range of machine learning algorithms for predictive modeling and classification tasks. It should allow the user to create models and evaluate their
performance using appropriate metrics.
5. Ease of Use: The software should be user-friendly and easy to navigate. It should have a simple
interface, clear documentation, and offer training and support for users.
6. Compatibility: The software should be compatible with different operating systems, file formats,
and data sources. It should allow the user to connect to different databases, spreadsheets, and cloud
platforms.
7. Security: The software should have robust security features to protect sensitive data. It should offer
encryption, user authentication, and access controls.
8. Integration: The software should integrate with other tools and applications, such as Excel, PowerPoint, and other data visualization and reporting software.
In summary, an ideal data analysis software should provide a comprehensive set of features for manProf. Dr Balasubramanian Thiagarajan MS D.L.O.
aging, analyzing, and visualizing data. It should be user-friendly, compatible, secure, and integrate
with other tools and applications.
Advantage of using Excel as a statistical tool:
Microsoft Excel is a widely used spreadsheet software that provides several advantages as a data analysis tool. Some of the advantages are:
1. Familiarity: Many people are familiar with Excel as it is widely used in business and academic
settings. Its user-friendly interface, coupled with basic knowledge of Excel, enables users to quickly
perform data analysis without having to learn new software.
2. Versatility: Excel is versatile and can be used for a wide range of data analysis tasks. It can handle
large datasets, perform complex calculations, and generate charts, graphs, and pivot tables.
3. Customization: Excel provides a lot of customization options for data visualization, including formatting and design of charts, graphs, and tables. Users can create their own templates and formatting
styles to meet their specific needs.
4. Easy to Share: Excel spreadsheets can be easily shared with others through email or cloud-based
platforms, allowing multiple users to collaborate and work on the same file.
5. Integration: Excel integrates with other Microsoft Office applications, such as Word and PowerPoint, allowing users to easily copy and paste data or visuals across different documents.
6. Macros and Add-ons: Excel provides functionality for creating macros and add-ons, which allow
users to automate tasks and perform complex analysis using third-party software.
7. Cost-effective: Excel is relatively inexpensive compared to other data analysis software, making it
an affordable option for small businesses or individuals.
Why software should be used for data analysis?
Data analysis software can be incredibly valuable for several reasons:
1. Efficiency: Data analysis software can process large amounts of data much more quickly and accurately than manual methods, saving time and reducing errors.
2. Visualization: Most data analysis software includes data visualization tools, which can help you
better understand your data and communicate your findings to others.
3. Complexity: Many types of data are too complex to analyze manually, and require specialized software to extract meaningful insights.
4. Reproducibility: Data analysis software allows you to easily document and reproduce your analysis, which is critical for scientific research and business decision-making.
5. Automation: Many data analysis software programs include automation tools that can streamline
repetitive tasks, freeing up time for more complex analysis.
Overall, data analysis software can help you extract valuable insights from your data in a faster, more
accurate, and more reproducible way than manual analysis methods.
Mastering Statistical Analysis with Excel
10
Preparing Excel for data analysis:
Enabling Excel Analysis tool pack:
In order to perform complex statistical analysis the user can save time by using the Analysis Tool
pack. This is not enabled in Excel as default. The user needs to install this tool pack.
As a first step File tab is clicked and the sub menu Options is chosen.
Add-Ins category is selected.
In the Manage box, Excel Add-ins is selected and the Go button is clicked.
This opens up the Add-Ins box where the Analysis ToolPack radio button is selected and OK button
is clicked. This will automatically enable this plugin.
Image showing Options submenu listed under File Menu
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Add in’s submenu
Mastering Statistical Analysis with Excel
12
Image showing Excel Add in selected before clicking on GO button
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the Add ins that needs to be enabled selected by checking on
their respective boxes. On clicking the OK button the Add Ins get enabled.
Mastering Statistical Analysis with Excel
14
Image showing Data Analysis plug in enabled
During the entire process of enabling Data Analysis plugin it is advisable to be connected to the
Internet since in some situations files need to be downloaded from Microsoft server. On successful
installation and enabling of Data Analysis plug in it will get displayed in the top ribbon as a menu
header.
Using a data analysis plug-in in Excel can provide several advantages, including:
1. Improved efficiency: With the use of data analysis plug-ins, you can perform complex data analysis
tasks in a more efficient manner. These plug-ins can help you automate repetitive tasks, reducing the
time and effort required to perform data analysis.
2. Increased accuracy: Data analysis plug-ins can help eliminate errors that can occur due to manual data entry or calculations. By automating tasks, these plug-ins can help reduce the likelihood of
errors and improve the accuracy of your analysis.
3. Better insights: With the help of data analysis plug-ins, you can uncover insights that may be hidden in your data. These plug-ins can help you analyze data more comprehensively, allowing you to
identify trends, patterns, and correlations that you might otherwise miss.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
4. Improved visualization: Data analysis plug-ins can help you create more compelling visualizations
of your data, making it easier to communicate your findings to others. By creating charts, graphs,
and other visual representations of your data, you can make your analysis more accessible and easier
to understand.
5. Increased flexibility: With data analysis plug-ins, you can customize your analysis to fit your specific needs. Whether you need to perform complex calculations, filter data, or create custom charts,
data analysis plug-ins can help you do it quickly and easily.
Components of data analysis plugin:
The components of a data analysis plugin in Excel can vary depending on the specific plugin being
used. However, here are some common components:
1. Data import and management: This component allows you to import data from various sources,
clean and prepare the data for analysis, and manage it in Excel.
2. Descriptive statistics: This component provides a range of descriptive statistics, such as mean, median, mode, variance, standard deviation, and correlation coefficients.
3. Inferential statistics: This component allows you to perform hypothesis testing, including t-tests,
ANOVA, and regression analysis.
4. Data visualization: This component allows you to create various charts and graphs, such as histograms, scatterplots, and box plots, to help you visualize your data.
5. Predictive analytics: This component uses machine learning algorithms to make predictions based
on historical data. This can include forecasting, clustering, and classification.
6. What-if analysis: This component allows you to perform scenario analysis, sensitivity analysis, and
goal seeking to help you understand how different variables might impact your analysis.
7. Optimization: This component allows you to optimize your data analysis by finding the best solution to a problem using mathematical models and algorithms.
These components can be used individually or in combination to help you perform various types of
data analysis tasks within Excel.
Mastering Statistical Analysis with Excel
16
Percentages
2
omputations in Ancient Rome were frequently conducted using fractions that were multiples of
1/100, long before the decimal system came into existence. One notable example was Augustus’
imposition of a tax of 1/100 on items sold at auction, referred to as centesima rerum venalium. The
use of such fractions for computations was essentially the equivalent of calculating percentages.
C
The origin of the term “percent” can be traced back to the Latin phrase “per centum,” which means
“by the hundred” or “hundred.” The symbol for “percent” gradually developed from the Italian
phrase “per cento,” which means “for a hundred.” The abbreviation “p.” was often used for “per,” but
eventually disappeared. The word “cento” was then shortened to two circles separated by a horizontal line, which is the basis for the modern “%” symbol.
Percentages are an important tool in biostatistics for communicating and analyzing data. In the
field of biostatistics, percentages are commonly used to describe the prevalence or incidence of a
disease or condition within a population. Percentages are also used to summarize data in research
studies, such as the proportion of patients in a clinical trial who responded to a particular treatment.
In addition, percentages are used to calculate relative risk, odds ratios, and other measures of
association between variables in biostatistics. For example, percentages can be used to compare the
proportion of patients who experienced a particular outcome in different treatment groups in a
clinical trial.
Furthermore, percentages are useful in the presentation of data and can help to make complex
statistical information more accessible and understandable to a wide range of audiences. When
presenting data in tables or graphs, percentages can be used to highlight important patterns or
differences between groups.
Overall, percentages are an important tool in biostatistics that can help researchers and healthcare
professionals to better understand and communicate data related to health and disease.
Scenarios in which percentages can be used:
Percentages are used in statistics to summarize and communicate data in a variety of scenarios,
some of which include:
Prof Dr Balasubramanian Thiagarajan MS D.L.O
1. Describing the prevalence of a disease or condition: In epidemiology, percentages are commonly
used to describe the proportion of individuals in a population who have a particular disease or condition.
One possible dataset that could be used to calculate the prevalence of a disease using percentage
would be a survey of a population that includes questions about whether or not individuals have been
diagnosed with the disease in question. For example, a survey could ask:
“Have you ever been diagnosed with [disease]?”
The responses to this question could be used to calculate the prevalence of the disease in the population. For instance, if the survey is conducted among a population of 1,000 individuals and 100 of them
respond that they have been diagnosed with the disease, the prevalence can be calculated as follows:
Prevalence = (Number of people with the disease / Total number of people surveyed) x 100
Prevalence = (100 / 1000) x 100
Prevalence = 10%
Thus, the prevalence of the disease in this population would be 10%.
2. Reporting survey results: In social sciences, percentages are often used to report the results of surveys, such as the percentage of people who support a particular policy or hold a certain belief.
3. Analyzing clinical trial data: In clinical trials, percentages are used to describe the proportion of
patients who experience a particular outcome, such as a side effect or response to treatment.
One possible dataset that could be used to analyze clinical trial data using percentages would be a
dataset that includes information about the number of patients who experienced different treatment
outcomes in a randomized controlled trial. For example, a dataset could include information about:
The number of patients in each treatment group (e.g. experimental treatment group and control
group)
The number of patients in each treatment group who experienced the desired treatment outcome (e.g.
complete remission, partial remission, or no response)
The number of patients in each treatment group who experienced adverse events (e.g. nausea, fatigue,
headache)
Using this data, percentages could be calculated to analyze the efficacy and safety of the experimental
treatment. For instance, the following percentages could be calculated:
Response rate: The percentage of patients in each treatment group who experienced the desired treatment outcome. This could be calculated as follows:
Response rate = (Number of patients with desired treatment outcome / Total number of patients in
the treatment group) x 100
Mastering Statistical Analysis with Excel
18
Adverse event rate: The percentage of patients in each treatment group who experienced adverse
events. This could be calculated as follows:
Adverse event rate = (Number of patients with adverse events / Total number of patients in the
treatment group) x 100
By comparing the response rate and adverse event rate between the experimental treatment group
and the control group, researchers can draw conclusions about the efficacy and safety of the experimental treatment.
4. Calculating risk: Percentages are used to calculate risk in medical research, such as the percentage of patients who experience a certain adverse event or the percentage of individuals in a population who develop a particular disease.
One possible dataset that could be used to calculate clinical risk using percentages is a dataset that
includes information about patient characteristics and outcomes for a particular disease or condition. For example, a dataset could include information about:
Age, gender, and other demographic information for each patient
Medical history, including comorbidities and risk factors
Laboratory values and other clinical measures
Outcomes, such as hospitalization, morbidity, mortality, or other adverse events
Using this data, percentages could be calculated to assess the risk of adverse events or outcomes for
specific patient populations. For example:
Mortality rate: The percentage of patients who died within a specified time period. This could be
calculated as follows:
Mortality rate = (Number of deaths / Total number of patients) x 100
Hospitalization rate: The percentage of patients who were hospitalized within a specified time period. This could be calculated as follows:
Hospitalization rate = (Number of hospitalizations / Total number of patients) x 100
Morbidity rate: The percentage of patients who experienced a specified adverse event or outcome
within a specified time period. This could be calculated as follows:
Morbidity rate = (Number of patients with adverse event or outcome / Total number of patients) x
100
By analyzing these percentages for different patient subgroups based on demographics, medical
history, and other factors, clinicians and researchers can identify patients who are at higher risk for
adverse events and outcomes, and develop targeted interventions to improve patient outcomes.
5. Comparing groups: Percentages can be used to compare the proportion of individuals in different groups who have a particular characteristic or experience a particular outcome.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
One possible dataset that could be used to compare two groups using percentages is a dataset that
includes information about a binary outcome for each individual in each group. For example, a dataset
could include information about:
Two groups (e.g., treatment group and control group)
A binary outcome (e.g., success or failure of a treatment, presence or absence of a disease, etc.)
The number of individuals in each group with the binary outcome
Using this data, percentages could be calculated to compare the prevalence of the binary outcome in
each group. For example:
Success rate: The percentage of individuals in each group who experienced success. This could be
calculated as follows:
Success rate = (Number of individuals with success in the group / Total number of individuals in the
group) x 100
Failure rate: The percentage of individuals in each group who experienced failure. This could be calculated as follows:
Failure rate = (Number of individuals with failure in the group / Total number of individuals in the
group) x 100
By comparing the success rate and failure rate between the two groups, researchers can draw conclusions about the effectiveness of the treatment, or the prevalence of the disease in the two groups.
For example, if the success rate in the treatment group is significantly higher than the success rate in
the control group, this may suggest that the treatment is effective in improving the binary outcome.
Conversely, if the failure rate in the treatment group is significantly lower than the failure rate in the
control group, this may suggest that the treatment is effective in preventing the binary outcome.
6. Presenting data: Percentages are commonly used in tables and graphs to present data in a clear and
concise manner.
Overall, percentages are a versatile tool in statistics that are used in many different scenarios to communicate and analyze data.
Pitfalls:
While percentages are a useful tool in statistics, there are also several pitfalls to be aware of:
Misleading representation of small sample sizes: Percentages can be misleading when based on small
sample sizes. For example, if only a few individuals are included in a study, a small change in the number of individuals with a particular characteristic can lead to a large change in the reported percentage.
Omitting important context: Percentages can be misleading if important context is omitted. For example, a percentage may be reported without noting the sample size or the criteria used to define the
Mastering Statistical Analysis with Excel
20
group being described.
Confusing correlation with causation: Percentages can be misleading when used to describe correlations between variables without considering other factors that may be influencing the relationship. For example, a high percentage of individuals who smoke may be associated with a higher
risk of lung cancer, but this does not necessarily mean that smoking causes lung cancer.
Failing to account for multiple comparisons: Percentages can be misleading if multiple comparisons are made without adjusting for the increased probability of finding a significant result by
chance.
Using inappropriate denominators: Percentages can be misleading if the denominator used is not
appropriate for the question being asked. For example, reporting the percentage of patients who
experienced a side effect without also reporting the total number of patients in the study can be
misleading.
Overall, while percentages can be a useful tool in statistics, it is important to use them carefully
and in conjunction with other methods to avoid these pitfalls.
Let us calculate percentage using Excel:
Assuming a student has answered 40 questions out of 50 correctly in order to calculate the percentage of correct answers the user should
1. click on any blank cell
2. Inside the cell 40/50 is typed and then ENTER Key is pressed. It returns a decimal. In order to
change the decimal places that appear in the result, the increase decimal icon is clicked, in the same
way to decrease the number of decimal places then the decrease decimal icon is clicked.
3. In order to convert the decimal value to percentage the cell should be selected and on the Home
tab, the % icon is clicked.
Variants of pecentage calculation:
There are several variants of percentage calculations, including:
1. Percent increase/decrease: This measures the percentage change between two values. The formula is: percent change = (new value - old value) / old value * 100
If the result is positive, it represents a percent increase, and if it’s negative, it represents a percent
decrease.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing marks of students entered in the spreadsheet
Image showing the entered marks converted to percentage by clicking on the percentage icon
Mastering Statistical Analysis with Excel
22
2. Percent of a whole: This measures what percentage one value is of another value. The formula is:
percent of a whole = (part / whole) * 100
For example, if you want to find out what percentage 75 is of 100, the calculation would be:
percent of a whole = (75 / 100) * 100 = 75%
3. Percentages as proportions: This measures a percentage as a fraction or proportion of a whole. The
formula is:
percentage as a proportion = percentage / 100
For example, if you want to find out what fraction 20% represents, the calculation would be:
percentage as a proportion = 20 / 100 = 0.2
4. Gross and net percentages: Gross percentages are calculated based on the original value, while net
percentages are calculated based on a modified value. For example, if a product is sold at a discount of
20%, the gross percentage is 20%, but the net percentage (the percentage reduction in price) is 16.67%.
These are just a few examples of the many ways that percentage calculations can be used.
Examples for percent increase/decrease formula:
1. Percent increase:
Old value: 100
New value: 150
Percent increase = (150 - 100) / 100 * 100 = 50%
In this example, the new value is 50% higher than the old value.
2. Percent decrease:
Old value: 80
New value: 64
Percent decrease = (64 - 80) / 80 * 100 = -20%
In this example, the new value is 20% lower than the old value.
3. Percent change (positive and negative):
Old value: 200
New value: 150
Percent change = (150 - 200) / 200 * 100 = -25%
In this example, the new value is 25% lower than the old value.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
4. No change:
Old value: 60
New value: 60
Percent change = (60 - 60) / 60 * 100 = 0%
In this example, there is no change between the old and new values.
Where percentages should not be used:
While percentages can be a useful tool in many situations, there are some cases where percentages
should not be used or should be used with caution. Some examples include:
1. Small sample sizes: When dealing with small sample sizes, percentages can be misleading. For
example, if a survey is conducted with only a few respondents, the resulting percentages may not be
representative of the overall population.
2. Misleading averages: Percentages can be misleading when dealing with averages. For example, if a
company has two products with vastly different profit margins, calculating an overall percentage may
not accurately represent the profitability of each product.
3. Complex relationships: Percentages can oversimplify complex relationships between variables. For
example, in medical research, a percentage increase in a certain treatment may not account for the
impact of other variables, such as patient demographics or comorbidities.
4. Sensitivity to change: Percentages can be misleading when dealing with values that are sensitive to
change. For example, if a stock is trading at a very low value, a small percentage increase may appear
significant, but it may not be a significant change in absolute terms.
5. Irrelevant contexts: Percentages can be irrelevant or misleading when used in inappropriate contexts. For example, using percentages to describe the likelihood of a rare event, such as winning the
lottery, may be meaningless.
Overall, percentages can be a useful tool, but it’s important to use them with caution and to consider
the specific context and limitations of the data being analyzed.
Mastering Statistical Analysis with Excel
24
Image showing old and new data entered. In the percent increase column the formula is typed preceded by = sign. On pressing ENTER key the value gets displayed
On pressing ENTER key the exact percentage change could be seen displayed. Note the green dot (red
down arrow). This is known as the handle. When the user pulls this handle downwards to include the
cells below the calculated values of the other rows get displayed in the empty cells.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the effects of pulling the handle downwards. The calculated percent increase get displayed in their respective columns.
Image showing calculation of percent decrease using Excel. Note the formula entered in the cell
where the value needs to be displayed
Mastering Statistical Analysis with Excel
26
Image showing the result of pressing the Enter key after keying in the calculation formula.
The result is displayed inside the cell chosen for it. The subsequent values can be filled in the
empty cells below automatically by pulling the green dot (handle downwards).
Image showing the result of pulling the handle downwards
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the percent change cells filled with the calculated data when the
handle of the first cell is pulled downwards. This is a nifty shortcut that Excel offers
to automate the calculation process. The formula specified is applied automatically
to the cells and the resultant result is displayed in the cell selected when the handle is
pulled downwards.
Mastering Statistical Analysis with Excel
28
3
I
Role of Ratio’s in Statistical
Analysis
n statistical analysis, ratios play an important role in several ways:
1. Comparison of two or more quantities: Ratios are commonly used to compare two or more
quantities. For example, in financial analysis, the debt-to-equity ratio is used to compare the
amount of debt a company has to its equity.
2. Scale transformation: Ratios are also used in statistical analysis to transform data from one
scale to another. For example, the odds ratio is commonly used in medical research to transform
binary data (such as whether a patient has a disease or not) into a ratio that can be analyzed
statistically.
3. Normalization: Ratios can be used to normalize data, which means adjusting data to remove
the effects of different scales. This is commonly done in financial analysis by calculating ratios
such as return on investment (ROI) or profit margin, which take into account the size of the
investment or revenue being analyzed.
4. Correlation analysis: Ratios can also be used in correlation analysis to identify the relationship
between two or more variables. For example, the debt-to-equity ratio of a company may be correlated with its stock price, and this relationship can be analyzed using statistical techniques.
Overall, ratios are an important tool in statistical analysis as they help to make comparisons,
transform data, normalize data, and identify relationships between variables.
Advantages of using ratio in statistical analysis:
There are several advantages of using ratios in various fields such as finance, accounting, and
statistical analysis:
1. Comparison: Ratios provide an effective way to compare different companies, investments, or
financial statements. By comparing ratios, analysts can quickly identify strengths and weaknesses, and make informed decisions based on the data.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
2. Simplification: Ratios can simplify complex data and present it in a more digestible format.
For example, a company’s financial statement may include a large amount of data, but ratios
can be used to summarize this information and present it in a concise manner.
3. Normalization: Ratios can help normalize data by accounting for differences in scale. This
is important because data can be difficult to compare when it is presented in different units or
currencies.
5. Identification of trends: Ratios can be used to identify trends over time. By comparing ratios
across multiple periods, analysts can identify patterns and trends in a company’s financial performance.
6. Standardization: Ratios can be used to standardize data across different companies or industries. This allows for easier comparisons between companies or industries that may use different accounting methods or financial reporting standards.
Overall, the use of ratios in analysis can provide valuable insights into financial and business
performance. They can help to simplify complex data, enable meaningful comparisons, identify trends, and standardize data for easier analysis.
Pitfalls of ratio analysis:
While ratios can be a useful tool for analyzing data, there are several potential pitfalls to be
aware of:
Ignoring the context: Ratios can be misleading if they are not considered in the appropriate
context. For example, a high debt-to-income ratio may be problematic for an individual, but it
could be perfectly normal for a business. Therefore, it’s important to consider the context of the
ratio being analyzed.
Data quality issues: Ratios are only as good as the data they are based on. If the data is inaccurate or incomplete, the resulting ratios will be unreliable. It’s important to verify the quality of
the data before relying on ratios.
Outliers: Extreme values in the data can have a significant impact on ratios. For example, a single high-value transaction in a dataset could skew the average transaction value and therefore,
any ratios derived from it. It’s important to identify and handle outliers appropriately.
Correlation vs. causation: Ratios can show a correlation between two variables, but they do not
necessarily indicate causation. It’s important to be cautious when making assumptions about
causality based solely on ratios.
Appropriate use: Ratios may not always be the most appropriate tool for analyzing a particu-
Mastering Statistical Analysis with Excel
30
lar dataset. It’s important to consider other analytical methods and choose the one that is best
suited for the specific dataset and research question.
Overall, ratios can be a useful tool for data analysis, but it’s important to be aware of their limitations and potential pitfalls in order to use them effectively.
Using Excel to calculate ratio:
Performing ratio analysis using Excel involves a few basic steps:
Gather the relevant financial data: You will need to gather financial data such as income statements, balance sheets, and cash flow statements for the period you want to analyze.
Calculate the ratios: Once you have the financial data, you can calculate the ratios you want to
analyze. Common ratios include liquidity ratios (current ratio, quick ratio), profitability ratios
(return on assets, return on equity), and solvency ratios (debt-to-equity ratio, interest coverage
ratio).
Use Excel formulas: Excel has built-in formulas for calculating many of the common ratios.
For example, to calculate the current ratio, you can divide current assets by current liabilities
using the formula “=current assets/current liabilities”. Similarly, you can use the formula “=net
income/total assets” to calculate the return on assets ratio.
Format the results: Once you have calculated the ratios, you can format the results to make
them easier to read and interpret. You can use conditional formatting to highlight ratios that
are above or below a certain threshold, or you can use charts and graphs to visualize the trends
in the data.
Interpret the results: Finally, it’s important to interpret the results of the ratio analysis. You
should compare the ratios to industry benchmarks or historical data to determine if they are in
line with expectations. You should also consider the context of the ratios and any other relevant
factors that may affect the company’s financial performance.
Example:
Here are some sample data you can use to calculate ratios:
Company A Financial Data:
Total Assets: $10,000
Total Liabilities: $5,000
Total Equity: $5,000
Net Income: $2,000
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Revenue: $20,000
Company B Financial Data:
Total Assets: $15,000
Total Liabilities: $7,000
Total Equity: $8,000
Net Income: $3,000
Revenue: $30,000
You can use this data to calculate various financial ratios such as:
Debt-to-Equity Ratio: Total Liabilities / Total Equity
Company A: 5,000 / 5,000 = 1
Company B: 7,000 / 8,000 = 0.875
Return on Equity (ROE): Net Income / Total Equity
Company A: 2,000 / 5,000 = 0.4 or 40%
Company B: 3,000 / 8,000 = 0.375 or 37.5%
Asset Turnover Ratio: Revenue / Total Assets
Company A: 20,000 / 10,000 = 2
Company B: 30,000 / 15,000 = 2
Gross Profit Margin: (Revenue - Cost of Goods Sold) / Revenue
Assume Company A has a Cost of Goods Sold of $10,000 and Company B has a Cost of Goods
Sold of $15,000.
Company A: (20,000 - 10,000) / 20,000 = 0.5 or 50%
Company B: (30,000 - 15,000) / 30,000 = 0.5 or 50%
Note that there are many other financial ratios you can calculate using different data points, but
the above ratios are some common examples.
Steps to calculate Ratio using Excel:
Data entry: Data should be entered into the respective rows and columns as shown in the image below.
Mastering Statistical Analysis with Excel
32
Image showing Financial data of two companies A and B entered into Excel.
Image showing the formula for Debt Equity ratio entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Debt Equity ratio for both Company A and B calculated using Excel
Image showing formula for calculating Asset turn over ratio entered
Mastering Statistical Analysis with Excel
34
Image showing Asset Turn over Ratio of both companies calculated using Excel
Image showing formula for Return of Equity entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Ratio of Return of Equity calculated for Companies A and B.
Types of Ratios:
There are several types of ratios that can be calculated in statistics, including:
1. Financial ratios: These are used to analyze the financial performance of a company, and include ratios such as the debt-to-equity ratio, the price-to-earnings ratio, and the return on investment ratio.
2. Probability ratios: These are used to measure the likelihood of an event occurring, and include
ratios such as the odds ratio and the probability ratio.
3. Efficiency ratios: These are used to measure the efficiency of a company’s operations, and include
ratios such as the inventory turnover ratio and the accounts receivable turnover ratio.
Mastering Statistical Analysis with Excel
36
4. Liquidity ratios: These are used to measure a company’s ability to meet its short-term financial
obligations, and include ratios such as the current ratio and the quick ratio.
5. Performance ratios: These are used to measure a company’s overall performance, and include ratios such as the return on assets ratio and the return on equity ratio.
6. Growth ratios: These are used to measure a company’s growth potential, and include ratios such as
the earnings per share growth ratio and the sales growth ratio.
7. Leverage ratios: These are used to measure a company’s level of debt, and include ratios such as the
debt ratio and the debt-to-assets ratio.
Overall, ratios are useful tools for analyzing and interpreting data in a variety of settings, and can
provide valuable insights into the performance and potential of a company or other entity.
Probability Ratio:
The probability ratio is a statistical measure used to compare the likelihood of an event occurring in
two different groups. It is also known as the likelihood ratio.
The probability ratio is calculated by dividing the probability of the event occurring in one group by
the probability of the event occurring in another group. The resulting ratio provides a measure of
how much more likely the event is to occur in one group compared to the other.
The formula for the probability ratio can be expressed as:
Probability Ratio = P(Event|Group 1) / P(Event|Group 2)
where P(Event|Group 1) represents the probability of the event occurring in Group 1, and
P(Event|Group 2) represents the probability of the event occurring in Group 2.
The probability ratio can be used in a variety of statistical applications, such as hypothesis testing
and logistic regression analysis. It is particularly useful in medical research, where it is often used to
compare the effectiveness of different treatments or interventions.
Example:
Here’s an example dataset that you can use to calculate probability ratios:
Suppose you have collected data on the outcomes of two treatments (A and B) for a particular
medical condition. You have data on 100 patients who received treatment A, and 100 patients who
received treatment B. Here’s a summary of the data:
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Treatment A:
70 patients were cured
30 patients were not cured
Treatment B:
50 patients were cured
50 patients were not cured
To calculate the probability ratio for cure rate between Treatment A and Treatment B, you can use
the following formula:
Probability ratio = (probability of cure in Treatment A) / (probability of cure in Treatment B)
The probability of cure for Treatment A is 70/100 = 0.7, and the probability of cure for Treatment B is
50/100 = 0.5. Plugging these values into the formula, we get:
Probability ratio = 0.7 / 0.5 = 1.4
This means that the probability of cure for Treatment A is 1.4 times higher than the probability of
cure for Treatment B.
Using Excel to calculate Probability Ratio:
Step 1 : Data Entry
Data should be entered in to columns and rows of Excel as shown below.
Image showing Data entered into Spread sheet
Mastering Statistical Analysis with Excel
38
Step 2:
Entering the formula to calculate probability ratio.
Image showing formula for calculation of cure probability of treatment A entered.
Image showing Probability of cure for both modalities of treatment calculated
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing formula for calculating Probability ratio entered
Image showing Probability cure ratio calculated using Excel
Mastering Statistical Analysis with Excel
40
Efficiency Ratio:
Efficiency ratio is a financial ratio that measures a company’s ability to use its assets and liabilities to
generate revenue. It is also known as the expense ratio or operating efficiency ratio. The efficiency
ratio is calculated by dividing a company’s operating expenses by its net revenue.
The efficiency ratio provides insight into how well a company is managing its expenses relative to its
revenue. A lower efficiency ratio indicates that a company is more efficient at generating revenue,
while a higher efficiency ratio suggests that a company is less efficient at generating revenue and may
have higher operating costs.
An efficient company will have a lower efficiency ratio, indicating that it is using its assets and liabilities effectively to generate revenue. On the other hand, an inefficient company will have a higher
efficiency ratio, indicating that it is using more resources than necessary to generate revenue.
The efficiency ratio is often used by analysts and investors to evaluate a company’s financial health
and operational efficiency. It can be compared with industry benchmarks or historical data to assess
a company’s performance relative to its peers or its own past performance.
Example:
Here’s an example dataset that you can use to calculate efficiency ratio:
Suppose you have the following financial data for a company:
Operating expenses: $200,000
Net revenue: $1,000,000
To calculate the efficiency ratio, you can use the following formula:
Efficiency ratio = Operating expenses / Net revenue
Plugging in the values from the example dataset, we get:
Efficiency ratio = $200,000 / $1,000,000 = 0.2
This means that for every dollar of revenue generated, the company is spending $0.20 on operating
expenses. A lower efficiency ratio indicates that the company is more efficient at generating revenue,
while a higher efficiency ratio suggests that the company is less efficient at generating revenue and
may have higher operating costs.
It’s worth noting that the efficiency ratio can vary widely between industries and companies, and
should be compared to industry benchmarks or historical data to assess a company’s performance
relative to its peers or its own past performance.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Using Excel to calculate Efficiency Ratio:
Excel can be used easily for calculating Efficiency ratio.
Step 1 : Data entry. Data is entered into Excel in rows and colums as shown below.
Image showing data entered into Excel and formula for calculating Efficiency Rate . Formula can
also be seen entered into the cell as shown in the image.
Image showing the value displayed when ENTER key is pressed
Mastering Statistical Analysis with Excel
42
Liqudity ratio:
A liquidity ratio is a financial metric that measures a company’s ability to meet its short-term debt
obligations. It is a measure of a company’s ability to convert its assets into cash quickly in order to
pay off its current liabilities.
The most common liquidity ratios are the current ratio and the quick ratio. The current ratio is calculated by dividing a company’s current assets by its current liabilities, while the quick ratio is calculated by subtracting inventory from current assets and dividing the result by current liabilities.
A higher liquidity ratio indicates that a company has a greater ability to pay off its short-term debts.
However, excessively high liquidity ratios can also indicate that a company is not using its assets
efficiently to generate profits. Therefore, it is important to consider other financial metrics, such as
profitability and efficiency ratios, when evaluating a company’s financial health.
Example:
Here’s an example of data for calculating the current ratio and the quick ratio:
Current assets: $100,000
Current liabilities: $50,000
Inventory: $20,000
To calculate the current ratio:
Current ratio = Current assets / Current liabilities
Current ratio = $100,000 / $50,000
Current ratio = 2.0
The current ratio in this example is 2.0, which means that the company has $2 in current assets for
every $1 in current liabilities.
To calculate the quick ratio:
Quick ratio = (Current assets - Inventory) / Current liabilities
Quick ratio = ($100,000 - $20,000) / $50,000
Quick ratio = $80,000 / $50,000
Quick ratio = 1.6
The quick ratio in this example is 1.6, which means that the company has $1.60 in quick assets (current assets minus inventory) for every $1 in current liabilities. This ratio is often considered a more
conservative measure of liquidity than the current ratio, as it excludes inventory which may take
longer to sell and convert into cash.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Excel can easily be used to calculate liquidity ratio pretty easily. Of course it does not have preconficgured approach. The user needs to key in the formulae for calculating liquidity ratio.
Step 1 :
The user will have to enter relevent data into rows and colums of Excel sheet as shown below.
Image showing the financial data of the company entered
Image showing formula to calculate current ratio entered
Mastering Statistical Analysis with Excel
44
Image showing the current ratio value displayed on pressing the Enter key
Image showing fromula for quick ratio entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Quick ratio number displayed when ENTER key is pressed
Performance ratio:
A performance ratio is a financial metric used to assess a company’s operational efficiency and effectiveness. These ratios measure a company’s ability to generate profits from its operations, manage its
assets and liabilities, and use its resources efficiently.
There are several types of performance ratios, including profitability ratios, efficiency ratios, and
leverage ratios.
Profitability ratios measure a company’s ability to generate profits relative to its revenues, costs, and
investments. Examples of profitability ratios include gross profit margin, net profit margin, return on
assets (ROA), and return on equity (ROE).
Efficiency ratios measure how efficiently a company uses its resources, such as its assets, inventory,
and accounts receivable, to generate sales and profits. Examples of efficiency ratios include inventory
turnover, accounts receivable turnover, and asset turnover.
Leverage ratios measure a company’s ability to manage its debt and financial leverage, and its ability
to repay its creditors. Examples of leverage ratios include debt-to-equity ratio, interest coverage ratio,
and debt-to-assets ratio.
Overall, performance ratios provide insight into a company’s financial health and help investors and
analysts make informed decisions about the company’s future prospects.
Mastering Statistical Analysis with Excel
46
Example:
Here’s an example of data for calculating some common performance ratios:
Revenue: $500,000
Cost of Goods Sold (COGS): $350,000
Net Income: $50,000
Total Assets: $1,000,000
Total Equity: $500,000
Total Liabilities: $500,000
Accounts Receivable: $100,000
Inventory: $75,000
To calculate some common performance ratios:
Gross Profit Margin:
Gross Profit Margin = (Revenue - COGS) / Revenue
Gross Profit Margin = ($500,000 - $350,000) / $500,000
Gross Profit Margin = 0.3 or 30%
The gross profit margin in this example is 30%, which means that the company generated $0.30 in
gross profit for every $1 in revenue.
Return on Assets (ROA):
ROA = Net Income / Total Assets
ROA = $50,000 / $1,000,000
ROA = 0.05 or 5%
The ROA in this example is 5%, which means that the company generated $0.05 in net income for
every $1 in assets.
Inventory Turnover Ratio:
Inventory Turnover Ratio = COGS / Average Inventory
Average Inventory = (Beginning Inventory + Ending Inventory) / 2
Average Inventory = ($75,000 + $75,000) / 2
Average Inventory = $75,000
Inventory Turnover Ratio = $350,000 / $75,000
Inventory Turnover Ratio = 4.67
The inventory turnover ratio in this example is 4.67, which means that the company sold and replaced its inventory 4.67 times during the period.
Note that there are many other performance ratios that can be calculated using different financial
metrics, depending on the company’s industry, size, and other factors. These ratios can provide valu-
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
able insights into a company’s financial health and performance.
Growth Ratio:
“Growth ratio” can refer to different ratios depending on the context, but generally, it is a financial
ratio that measures the rate of change in a company’s earnings, revenue, or other financial metrics
over a specified period.
One of the most common growth ratios is the “earnings growth ratio,” which measures the rate at
which a company’s earnings are growing from year to year. This ratio is calculated by dividing the
difference between the current year’s earnings and the previous year’s earnings by the previous year’s
earnings and multiplying by 100 to get a percentage.
For example, if a company had earnings of $100,000 in the previous year and earnings of $120,000 in
the current year, the earnings growth ratio would be:
(($120,000 - $100,000) / $100,000) * 100 = 20%
This means that the company’s earnings grew by 20% from the previous year.
Other growth ratios can include revenue growth ratio, net income growth ratio, and operating
income growth ratio, among others. The specific ratio used will depend on the context and what the
user is trying to measure.
Leverage ratio:
Leverage ratio is a financial ratio that measures the degree of a company’s debt financing in relation
to its equity financing. It is a metric that helps investors and analysts evaluate a company’s financial
risk and solvency.
The two most commonly used leverage ratios are the debt-to-equity ratio and the debt-to-total assets
ratio.
The debt-to-equity ratio compares the amount of debt a company has taken on to the amount of equity it has raised. This ratio is calculated by dividing the company’s total liabilities by its total equity.
For example, if a company has $500,000 in liabilities and $1,000,000 in equity, its debt-to-equity ratio
would be 0.5.
The debt-to-total assets ratio compares a company’s total debt to its total assets. This ratio is calculated by dividing the company’s total liabilities by its total assets. For example, if a company has
$500,000 in liabilities and $2,000,000 in assets, its debt-to-total assets ratio would be 0.25.
In general, a higher leverage ratio indicates that a company is more heavily indebted and therefore
more financially risky. However, the optimal leverage ratio for a company will depend on factors
such as its industry, business model, and growth prospects.
Mastering Statistical Analysis with Excel
48
Example:
Sure, here’s an example of how to calculate the debt-to-equity ratio and the debt-to-total assets ratio
for a hypothetical company:
Assume that the company has the following balance sheet:
Total liabilities: $500,000
Total equity: $1,000,000
Total assets: $1,500,000
To calculate the debt-to-equity ratio, we divide the company’s total liabilities by its total equity:
Debt-to-equity ratio = Total liabilities / Total equity
Debt-to-equity ratio = $500,000 / $1,000,000 = 0.5
This means that the company has $0.50 of debt for every $1 of equity.
To calculate the debt-to-total assets ratio, we divide the company’s total liabilities by its total assets:
Debt-to-total assets ratio = Total liabilities / Total assets
Debt-to-total assets ratio = $500,000 / $1,500,000 = 0.33
This means that the company has $0.33 of debt for every $1 of assets.
Both of these ratios suggest that the company has a moderate amount of debt relative to its equity
and assets. However, the appropriate level of debt for a company will depend on its specific circumstances, such as its industry, growth prospects, and risk tolerance.
Using Excel to calculate Leverage ratio:
The following steps need to be followed.
Data of the companies (financial data) should be entered into Colomns and Rows of Excel as shown
below.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing financial data of 3 companies entered. Note the formula to calculate debt equity ratio
is also entered. On pressing ENTER key the value would be displayed.
Image showing the Debt to Equity ratio displayed. Note the red arrow. This arrow indicates a small
handle (dot) which can be pulled down to populate the lower cells with the calcualted data using the
formula aready inputted.
Mastering Statistical Analysis with Excel
50
Image showing the results of pulling the handle down thereby populating the lower cells with the
Debt to Equity ratio values.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Mastering Statistical Analysis with Excel
52
4
Overview of Datasets
resenting numerical findings of a study in a clear and concise manner is crucial, especially when
dealing with large data sets common in surveys or controlled experiments. A well-designed presentation can quickly reveal important features such as range, symmetry, concentration, and distribution. This
chapter will cover techniques, including tables and graphics, for effectively presenting data sets.
P
Frequency tables and graphs:
Excel can easily be used to enter data as well as to calculate the frequency table for the data below.
The following data represents the marks scored by a set of 30 students in maths.
In the first column the Roll number of the student who took the examination in Maths is entered.
These are consecutive numbers ranging from 1 to 30. This cation can easily be automated in excel by
using the following code: =ROW(Column number) in this case A1.
The cell where the first number is to be generated is selected.
The following code is keyed into the cell:
=ROW(A1)
On pressing the ENTER key the number 1 gets displayed inside the cell where the formula was entered. At the bottom right corner of the cell a small dot can be visualized. They are known as handles
in Excel. the user by clicking and pulling the handle downwards will ensure that the cells below are
populated with consecutive numbers.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
Image showing formula being keyed into the cell where the first number is to be generated. Note the
red arrow indicates the handle which when pulled downwards will populate the subsequent cells with
consecutive numbers
Image showing data entered in two columns
Mastering Statistical Analysis with Excel
54
The user should ideally find out the number of students who has scored in the following ranges:
31-40
41-50
51-60
61-70
71-80
81-90
91-100
These ranges should be typed in a column as shown below:
Image showing marks range typed in column E
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing two columns created with Range and Frequency as headers. The formula to calculate
the frequency is entered in the frequency column as displayed.
The following is the code that needs to be entered into each cell as shown in the image above:
=COUNTIFS($B$2:$B$30,”>=31”,B2:$B$30,”<=40”)
=COUNTIFS($B$2:$B$30,”>=41”,B2:$B$30,”<=50”)
=COUNTIFS($B$2:$B$30,”>=51”,B2:$B$30,”<=60”)
=COUNTIFS($B$2:$B$30,”>=61”,B2:$B$30,”<=70”)
=COUNTIFS($B$2:$B$30,”>=71”,B2:$B$30,”<=80”)
=COUNTIFS($B$2:$B$30,”>=81”,B2:$B$30,”<=90”)
=COUNTIFS($B$2:$B$30,”>=91”,B2:$B$30,”<=100”)
Mastering Statistical Analysis with Excel
56
Line graphs, Bar graphs and Frequency Polygons:
Data from a frequency table can be graphically pictured by a line graph, which plots the successive
values on the horizontal axis and indicates the corresponding frequency by the height of a vertical
line.
Histogram could be used to create a visual representation of the frequency range. Two values should
be taken into consideration for this purpose.
1. Bin containing the following values: This actually contains the number of ranges that the data
needs to be classified into. In this example 7 categories are needed. These categories include:
Bin
1 - 31-40 Marks
2 - 41-50 Marks
3 - 51-60 Marks
4 - 61-70 Marks
5 - 71-80 Marks
6 - 81-90 Marks
7 - 91-100 Marks
Frequency
2
3
3
4
5
7
5
Excel can be used to create histogram easily.
In Excel each one of these ranges are known as Bin’s. In this example there are 7 categories (Bin’s).
In order to create histograms the concept of Bin is used.
First Data tab is clicked.
This exposes the Data Analysis tab. On clicking the Data Analysis tab a window opens offering a list
of various calculations and graphs that can be generated / performed. In this window Histogram is
chosen. Then OK button is clicked.
On clicking the OK button histogram menu opens up.
The input range field is clicked. As soon as the cursor starts to blink the cells under the Range column are selected. On selection the selected cell addresses gets entered in the input range box.
If labels are included in selecting the cells of the column then the box in front of Label should be
checked.
.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the Data tab which on clicking reveals the Data analysis tab as marked with red circles.
Image showing Data Analysis window where Histogram is chosen and OK button is clicked
Next the Bin range field is clicked. As soon as the cursor starts to blink then the cells under the column BIN are selected. As soon as the selection is made the cell addresses can be found entered into
the BIN range field.
Mastering Statistical Analysis with Excel
58
Next the output range field should be checked. When the cursor starts to blink the cells where the
user desires to diplay the graph is selected. On clicking the OK button the graph will be displayed in
the cells selected as output range.
The Chart output box should be checked before clicking on the OK button.
Image showing Data Analysis window where the different fields have been entered.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Histogram generated
Frequency distribution of a dataset:
A frequency distribution of data is a summary of how often different values or ranges of values occur
within a dataset. It is a way to organize and present raw data in a tabular form, showing the frequency or count of each value or group of values in a dataset.
To create a frequency distribution, the dataset is first sorted into groups or classes, each representing
a range of values. The number of data points falling into each group is then counted, and this count
is referred to as the frequency of that group. The frequency distribution can be presented in a table or
a graph, which allows for easy visualization of the distribution of the data.
Frequency distributions are useful for understanding the patterns and characteristics of a dataset,
such as its central tendency, variability, and outliers. They are commonly used in statistical analysis,
data visualization, and data mining.
Here’s a sample data set for symmetric frequency distribution:
2, 4, 5, 5, 6, 6, 7, 7, 7, 8, 8, 8, 9, 11
This data set has a relatively even distribution of values around its mean, making it symmetric. The
mean of this data set is 6.5, and the median is also 6.5 since the number of values on either side of
the median is the same.
You can calculate the frequency distribution of each value by counting the number of times each value appears in the data set. For example, the frequency of the value 5 is 2, because it appears twice in
the data set. Similarly, the frequency of the value 8 is 3, because it appears three times in the data set.
Mastering Statistical Analysis with Excel
60
You can use this data set to explore various measures of central tendency and variability, such as
mean, median, mode, range, variance, and standard deviation.
Excel can be used to generate bar graphs for a given set of data. Simple scrutiny of the graph is sufficient to ascertain whether the values of the given data set are symmetrical or not.
Steps involved in this process:
Data entry - The given set of data is entered into the Excel column as shown below.
Image showing data entered into Excel column
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing insert tab which when clicked displays the graph generation tab (red circle)
The column containing the numerical data is selected.
Insert tab is clicked.
Under the view menu bargraph icon could be seen displayed.
On clicking the bar graph icon the user will be presented a choice of different varieties of bar graph
from which the desired one can be chosen. On clicking the desired type of bar graph the same will
be displayed within the spreadsheet.
If the data set is symmetrical the peaks of the bars will resemble a bell shaped curve.
Image showing the bargraph generated
Mastering Statistical Analysis with Excel
62
Example for creating frequency tables from raw data using Excel:
This Dataset includes number of followers for each individual in Instagram.
179
235
357
252
350
339
320
279
261
214
265
281
296
253
225
220
This is actually a raw data. What it can provide is minimum and maximum values. Our intention is
to try and categorize this dataset into various categories.
In the first step the user should open Excel and enter this numerical data under the column Insta
Followers. The data should be entered one below the other as shown in the image below:
Image showing data entered into the Excel spreadsheet
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
The first step would be to find out the minimum and maximum values of the dataset.
Formula used to calculate Minimum value of dataset: =MIN(cell range)
Image showing the formula entered for calculation of minimum value
Image showing Minimum value calculated when the ENTER button is pressed.
Mastering Statistical Analysis with Excel
64
The next step would be to calculate the maximum value in the dataset.
Image showing the formula to caculate the maximum value of the dataset entered
On pressing the ENTER key the maximum value in the dataset would be displayed inside the specified cell.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the Maximum value displayed on pressing ENTER key
The next step would be to calculate the range between the maximum and minimum values in order
to create specific classes where the various components of the dataset can be placed. Formula to
calculate Range is Maximim - Minimum.
Image showing the formula being used to calculate the range of data
Mastering Statistical Analysis with Excel
66
Image showing the value of Range displayed on the press of ENTER key
Hypothetically if the user decides to classify the dataset into 5 subcategories the following process
can be followed. The number of categories depends on the sie of the dataset.
In order to identify the category range on needs to divide the range value by the number of subcategories (5 in this case). (178/5).
Image showing category ranges calculated
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
The entire dataset is divided into 5 categories (classes)
The Range has been calculated as 36. Hence subsequent classes should differ by this value starting
from the minimum value to the maximum value of the dataset.
Class
Range
1
179-215
2
216-252
3
253-288
4
289-324
5
325-360
The following formula can be used to calculate the frequency:
1. =COUNTIF(A2:A17, “<=”&H19) (A2:A17 is the data range, “<=” less than or equal to & highest
number in the range.
2. =COUNTIF(A2:A17, “<=”&H20)-2 (Formula is the same but -2 is actually the frequecy of the
earlier class. This is done to ensure that the previous class is excluded from the subsequent ones.
3. =COUNTIF(A2:A17, “<=”&H21)-5
The same formula is used to calculate all the classes.
Mastering Statistical Analysis with Excel
68
Image showing the values of 5 classes
Image showing frequency values of the 5 classes calculated using Excel
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Another way of calculating frequency chart for a dataset using Pivot table feature in Excel:
There are numberous ways to Rome, similarly frequency table can be plotted for a dataset using a
nifty feature available in Excel known as PIVOT TABLE.
In the example below the reader will be exposed to various steps that needs to be followed to perform this analysis.
Data Entry:
The dataset that needs to be processed should be entered into Excel spreadsheet in rows and columns. In this example the dataset contains the cost of food in various restaurants. The same are entered in a column. As a user one would be interested in knowing the price bands of food and classify
restaurants as per the price band.
Image showing data entered in Excel. It contains three columns (Restaurant, Quality rating, and
Meal price) It has 300 datasets
Mastering Statistical Analysis with Excel
70
Image showing the column containing data (meal price) selected
Image showing Pivot table tab which needs to be clicked
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
After selecting the column containing the price of Meal Insert tab is clicked. This displays the Pivot
table tab. This tab can be clicked to invoke pivot table function in EXCEL.
Image showing PIVOT table menu box
In this Pivot table menu one can find the Table range column already filled because the entire column along with the header is selected already. In the next field New work sheet radio button is
created. On clicking the OK button a new worksheet containing the data will be created.
Image showing Pivot table created as a new Work sheet
Mastering Statistical Analysis with Excel
72
The Pivot table field Meal Price is drawn downwards to populate the fields Rows and Values. It
should be drawn downwards separately to perform this function as shown in the image below.
Image showing the Pivot table field Meal Price drawn downwards (red arrows) to fill the fields Rows
and Values
Image showing Pivot table generated. By default it displays the sum of Meal prices
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the Pivot table fields. By default the sum of meal price is displayed. In order to
change the default value the down arrow (red circle) next to sum of meal price is clicked.
In the submenu displayed on clicking the down arrow count is chosen in the summarize field menu
count is chosen instead of sum as we are interested in classifying the number of restaurants as per the
price band of the food served.
Mastering Statistical Analysis with Excel
74
Image showing the values under meal price column arranged countwise as set in the pivot table settings. This displays the exact count of restaturants serving the meal in a specified price band.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Our next endeavour would be to categorize these price bands into 10 categories so that the data
would become more manageable for statistical analysis.
To achieve this the user should click on any field in the Pivot table. Once the selection has been
made mouse cursor is placed over the selection and right clicked bringing out the menu as shown
below.
Image showing the right click menu from which Group is chosen
Mastering Statistical Analysis with Excel
76
In this menu the user needs to provide the start and end value which is nothing but lowest and highest values in the dataset. The number of groups the dataset needs to be divided is also given in BY
field. In this field the number of categories are indicated as 10.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the end result
For better understanding of the dataset the user can generate bar graph which could help in better
understanding of the dataset. Bar chart can be generated by selecting all the rows and columns containing data and then clicking Insert tab. On clicking Insert tab it opens up a set of tabs from which
Bar chart could be chosen. In the Bar chart menu vertical bar chart is chosen and given OK. This
will generate a bar graph for the dataset as shown below.
Image showing Bar graph generated for the dataset
Mastering Statistical Analysis with Excel
78
Calculating Frequency table for Nominal data using Pivot table:
In this example a sample data is taken which contains two variables (nominal):
1. Sex
2. Marital status
Data is entered into Excel Spreadsheet.
Male is indicated by M
Female is indicated by F
Married is indicated by MA
Single is indicated by S
Image showing data entered into Excel Spreadsheet
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
The columns containing dataset are selected and then Insert tab is clicked. This would reveal Pivot
table tab which should be clicked to bring up the pivot table dialog box. Since the dataset has already
been selected the Table/Range field of pivot table dialog box would be found already populated with
the selected cell’s addresses. New worksheet radio button is selected by clicking on it. This will ensure that the Pivot table is displayed in a new worksheet.
Image showing Pivot table creation dialog box
Mastering Statistical Analysis with Excel
80
On clicking OK button Pivot table is generated.
Image showing Pivot table generated by dragging and placing the Gender and Status variables i the
appropriate fields. Gender is placed in Rows field and Status is placed under column field.
Pivot table created and displayed
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Next step is generating a bar graph with the dataset to visualize the dataset.
The entire data is selected. Insert tab is clicked. This action brings out another set of tabs. Under
the view menu Bar chart tab could be seen. On clicking the barchart icon it displays a variety of bar
charts that can be created. The ideal one is selected by clicking on it. This results in display of bar
chart in the Excel spreadsheet.
Image showing Bar graph for the dataset generated.
Mastering Statistical Analysis with Excel
82
There are several types of datasets, and they can be categorized based on various factors such as the
way data is collected, the nature of the data, and the intended use of the data. Here are some of the
most common types of datasets:
1. Structured datasets: Structured datasets are organized in a tabular form with well-defined columns and rows. These datasets are typically used in databases and spreadsheets, and they can be
easily manipulated and analyzed using statistical methods.
2. Unstructured datasets: Unstructured datasets contain data that is not organized in a predefined
manner. Examples of unstructured datasets include text documents, images, and videos. These datasets require advanced techniques such as natural language processing and computer vision to extract
meaningful insights.
3. Time-series datasets: Time-series datasets are organized based on the time dimension. These
datasets contain observations of a variable over time, and they are commonly used in forecasting and
predictive modeling.
4. Cross-sectional datasets: Cross-sectional datasets contain observations of a variable at a single
point in time. These datasets are commonly used in survey research and market research.
5. Longitudinal datasets: Longitudinal datasets contain observations of a variable over multiple
points in time. These datasets are used in longitudinal studies to study changes in a variable over
time.
6. Panel datasets: Panel datasets are a type of longitudinal dataset that contains observations of multiple variables over time. Panel datasets are commonly used in social sciences research to study individual behavior and decision-making.
7. Spatial datasets: Spatial datasets contain geographical information, such as latitude and longitude
coordinates, and can be used to analyze spatial patterns and relationships.
8. Graph datasets: Graph datasets contain data in the form of nodes and edges, representing the relationships between entities. These datasets are commonly used in social network analysis and graph
theory.
9. Simulation datasets: Simulation datasets are generated using computer simulations and contain
data on the behavior of a system or process. These datasets are commonly used in scientific research
and engineering.
These are just a few examples of the many types of datasets that exist. The type of dataset used depends on the nature of the data and the intended use of the data.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Structured datasets:
Structured datasets are a type of dataset that is organized in a highly organized, predetermined format. The data in a structured dataset is usually arranged in a table or spreadsheet-like format, with
each row representing an individual record or observation, and each column representing a specific
attribute or variable of that record.
The most common example of a structured dataset is a relational database, which is composed of
tables that contain records with predefined attributes. Structured datasets are characterized by their
consistency and predictability, making them easy to search, sort, and analyze with statistical methods.
Structured datasets are used in a variety of fields, including finance, healthcare, and customer relationship management. They can be used to analyze trends, patterns, and relationships between
different variables, as well as for predictive modeling and data mining. One advantage of structured
datasets is that they can be easily integrated with other datasets to form more comprehensive analyses.
Here is an example of a structured dataset in a tabular format:
ID
Name Age
Gender
Occupation
1
2
3
4
John
Mary
Tom
Lisa
Male
Female
Male
Female
Engineer
Accountant
Manager
Intern
32
28
45
22
In this example, each row represents an individual record, and each column represents a specific
attribute of that record. The attributes include ID, Name, Age, Gender, and Occupation. Each record
has a unique ID number, and the other attributes provide additional information about each individual, such as their age, gender, and occupation.
Structured datasets can contain much more data than this example, with many more columns and
rows. The consistency and predictability of the data make it easy to analyze using statistical methods,
which can reveal trends, patterns, and relationships between different variables. It is fairly easy to
analyze a structured dataset as it not be subjected to data cleaning processes.
Mastering Statistical Analysis with Excel
84
Unstructured dataset:
Unstructured datasets are a type of dataset that contains data that is not organized in a predefined
or structured manner. Unstructured data refers to data that is not easily quantifiable or analyzed by
traditional methods, and typically includes textual, visual, or auditory information.
Examples of unstructured datasets include:
1. Textual data, such as emails, social media posts, news articles, and customer reviews.
2. Visual data, such as images, videos, and live streams.
3. Audio data, such as phone calls, voice memos, and music.
4. Sensor data, such as weather reports, satellite images, and traffic data.
5. Web data, such as webpages, blogs, and forums.
Unstructured datasets can be difficult to analyze and interpret due to their lack of structure and standardization. However, they contain valuable insights and information that can be harnessed with the
use of advanced analytical techniques such as natural language processing (NLP), computer vision,
and machine learning.
Unstructured datasets are increasingly important in today’s world as the amount of unstructured
data generated continues to grow exponentially. Many industries, including healthcare, finance, and
marketing, are beginning to realize the value of unstructured data and are investing in technologies
to harness its potential.
Time series datasets:
Time series datasets are a type of dataset that are organized based on the time dimension. These
datasets consist of observations of a variable or multiple variables over a specified period of time,
with a regular or irregular time interval between each observation.
Time series datasets are commonly used in forecasting, trend analysis, and predictive modeling in
fields such as finance, economics, weather forecasting, and engineering. Examples of time series
datasets include stock prices, weather patterns, and sales data.
Time series datasets can be analyzed using various statistical and mathematical techniques, such as
moving averages, trend analysis, and seasonal decomposition. These techniques can help identify
patterns, trends, and cycles in the data, as well as make predictions about future values of the variable
being measured.
One of the challenges of working with time series datasets is dealing with missing values or irregular
time intervals between observations, which can affect the accuracy of the analysis. However, there
are methods to handle missing values, such as imputation techniques, and various methods to interProf. Dr Balasubramanian Thiagarajan MS D.L.O.
polate or resample the data to a regular time interval.
In summary, time series datasets are an important type of dataset that can provide valuable insights
into the behavior of a variable over time, and can be used to inform decision-making and prediction
models in a variety of fields.
Here is an example of a time series dataset:
Date
01/01/2022
02/01/2022
03/01/2022
04/01/2022
05/01/2022
06/01/2022
07/01/2022
08/01/2022
09/01/2022
10/01/2022
Sales
100
125
150
135
160
175
200
220
250
275
In this example, the dataset records daily sales for a specific product from January 1st, 2022 to January 10th, 2022. The dataset has two columns: Date and Sales. Each row represents an observation for
a specific date, and the Sales column records the number of sales on that day.
This time series dataset can be used to analyze the behavior of sales over time, identify trends, seasonality, and cyclic patterns in sales, and make predictions about future sales. For example, a moving
average analysis of this dataset could help identify an increasing trend in sales over time, while a seasonal decomposition analysis could help identify weekly and daily patterns in sales that can inform
inventory and staffing decisions.
Cross sectional dataset:
A cross-sectional dataset refers to a type of data collected from a group of individuals or entities at
a specific point in time, rather than over a period of time. In other words, cross-sectional data captures a snapshot of a population or sample at a given time, and it is usually collected through surveys,
questionnaires, or observational studies.
For example, a cross-sectional dataset could be a survey conducted to gather information about
people’s dietary habits, exercise routines, and health conditions at a particular moment. The survey
would collect data from a diverse group of individuals with different ages, genders, and lifestyles, and
the data collected would be used to explore relationships between the variables of interest.
Cross-sectional data is useful in many fields, including social sciences, public health, economics, and
business. However, one limitation of cross-sectional data is that it cannot be used to establish causality or determine the direction of a relationship between variables. To do so, longitudinal data, which
Mastering Statistical Analysis with Excel
86
is collected over time, is required.
Longitudinal dataset:
A longitudinal dataset is a type of data that is collected over time from the same group of individuals
or entities. In other words, a longitudinal study follows a group of subjects over a period of time and
collects data from them at multiple time points. This type of data is also referred to as panel data or
repeated measures data.
Longitudinal datasets are commonly used in fields such as epidemiology, psychology, and social sciences to study the changes that occur in individuals or groups over time. For example, a longitudinal
study may collect data on the development of cognitive skills in children over several years.
The advantage of longitudinal datasets is that they allow researchers to examine changes over time
and to study the effects of time-varying factors on outcomes. Additionally, they can help researchers
identify patterns of change that may not be apparent in cross-sectional data.
However, collecting and managing longitudinal datasets can be challenging, as it requires following
the same group of individuals over time and ensuring that data is collected consistently and accurately at each time point. Additionally, attrition, or the loss of subjects over time, can be a major issue
in longitudinal studies.
Panel dataset:
A panel dataset is a type of longitudinal dataset where the same group of individuals or entities are
observed at multiple time points, with measurements taken at each time point. Panel data is also
known as longitudinal or repeated measures data.
Panel datasets are commonly used in social sciences, economics, and business to study changes over
time and to identify causal relationships between variables. They allow researchers to study how individual-level characteristics and external factors affect outcomes over time.
For example, a panel dataset may be used to study the relationship between income and health status
over time. The same individuals would be surveyed at multiple time points to collect data on their income and health status, allowing researchers to examine changes over time and to identify the causal
relationship between income and health status.
Panel datasets have several advantages over cross-sectional datasets, including the ability to control
for individual-level differences and to study changes over time. However, panel datasets also have
some limitations, including the potential for attrition and missing data. Additionally, panel datasets
can be more complex to analyze than cross-sectional data because they require modeling of the correlations between the repeated measures.
Spatial dataset:
Spatial datasets refer to data that has a geographic or spatial component. In other words, spatial data
is data that is tied to a specific location on the Earth’s surface. Spatial datasets are used in many fields,
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
including geography, ecology, urban planning, and transportation.
Spatial data can take many forms, such as maps, satellite imagery, and GPS coordinates. Some examples of spatial datasets include:
1. Topographic maps that show the terrain and elevation of an area
2. Satellite images that show land cover and changes over time
3. GPS data that tracks the movement of vehicles or people
4. Climate data that shows temperature and precipitation patterns across a region
Spatial datasets are commonly analyzed using Geographic Information Systems (GIS) software,
which allows researchers to visualize, analyze, and manipulate spatial data. GIS software can be used
to create maps, identify patterns and trends in spatial data, and make predictions about future changes.
Spatial datasets are important for understanding the relationship between human activities and the
natural environment. They are used to inform decisions about land use, resource management, and
environmental policy, among other things.
Simulation dataset:
Simulation datasets are datasets that are generated through computer simulations or modeling. Simulation datasets are used in many fields, including engineering, physics, economics, and biology.
Simulation datasets are created by running computer simulations that model a specific phenomenon
or system. The simulation generates data that can be analyzed to study the behavior of the system
under different conditions. For example, a simulation dataset could be created to study the impact of
a new drug on a particular disease.
Simulation datasets can be useful when it is not feasible or ethical to conduct experiments in real life.
Simulations can be used to test the effects of interventions or treatments without exposing subjects
to potential risks. Additionally, simulations can be used to study complex systems that are difficult to
study in real life.
Simulation datasets can also be used to validate and improve models. By comparing simulation
results to real-world data, researchers can refine their models and make more accurate predictions
about the behavior of the system.
Simulation datasets can take many forms, depending on the type of simulation being run. They can
include time series data, spatial data, or network data, among other things. Simulation datasets can
be analyzed using a variety of statistical and machine learning techniques, depending on the research
question and the type of data being analyzed.
Mastering Statistical Analysis with Excel
88
Graph datasets:
Graph datasets are a type of data that represents relationships or connections between objects or
entities. A graph is a data structure that consists of nodes, or vertices, that are connected by edges.
Graph datasets are used in many fields, including social network analysis, biology, and computer
science.
In a graph dataset, nodes represent objects or entities, and edges represent relationships or connections between them. For example, a graph dataset could represent a social network, where nodes
represent people, and edges represent friendships or other connections between them.
Graph datasets can be directed or undirected. In a directed graph, edges have a specific direction,
indicating a one-way relationship between nodes. In an undirected graph, edges have no direction
and represent a two-way relationship between nodes.
Graph datasets can also have weights, which represent the strength or importance of the relationship
between nodes. For example, in a social network graph, weights could represent the number of interactions or the strength of the friendship between nodes.
Graph datasets are commonly analyzed using graph theory, which is a branch of mathematics that
studies the properties of graphs. Graph theory can be used to identify patterns and structures in
graph datasets, to measure the importance or centrality of nodes, and to make predictions about the
behavior of the graph over time.
Analysing time series data using Excel:
By analyzing time series data, it becomes possible to investigate characteristics over a period of time.
A time series consists of data points that are arranged in chronological order and collected at regular intervals, such as daily or annually. To analyze time series data in Excel, let us proceed with the
appropriate procedures.
The sample data this is used to analyze include 8 months sales data. All the sales data for this period
was compiled on the first day of the month.
The sample data used is shown below:
Date Sales
01-01-2020 100
01-02-2020 120
01-03-2020 150
01-04-2020 130
01-05-2020 160
01-05-2020 180
01-06-2020 200
01-07-2020 220
01-08-2020 240
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
In the first step data is entered into Excel as shown in the image below.
Image showing Data entered into Excel spreadsheet
Analyzing the dataset:
Data tab should be clicked. This opens up other tabs. One of these tabs is the Data analysis tab. This
should be clicked. This opens up the Data analysis window. In the data analysis window Exponential smoothening is choosen.
On clicking OK button Exponential Smoothening window opens up.
In the Exponential smoothening window in the input range the cursor is placed and clicked. the
Dataset including their heading are choosen. Their addresses can be found entered into the field.
The Labels tab should be selected if Headers of dataset are included in the input field as shown in the
image. The output range field is clicked and the cells where the user desires to display the result can
be entered. Simple selection of the cells is enough to fill up this field.
The chart ouput check box is also checked.
On clicking OK button a graph will be generated.
Mastering Statistical Analysis with Excel
90
Image showing Data Analysis tab which should be clicked to analyse the dataset
Image showing Exponential smoothening menu
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Exponential smoothening window
Image showing results displayed
Mastering Statistical Analysis with Excel
92
Data
Description
5
escriptive statistics refers to the techniques used to describe data, including graphical representations and measures of central tendency and variability. Its purpose is to provide a meaningful
summary of the data that can be used to generate insights and draw conclusions. Essentially, descriptive
statistics enables us to gain a deeper understanding of the data by presenting it in a concise and organized manner.
D
To describe a dataset, one typically follows these steps:
1. Identify the variables: Determine which variables are included in the dataset and what they represent. Variables are characteristics of the data, such as age, gender, or income.
2. Examine the data distribution: Look at the distribution of each variable to determine its range,
mean, median, mode, and any patterns or outliers.
3. Summarize the data: Use descriptive statistics such as measures of central tendency (mean, median,
mode) and measures of variability (range, variance, standard deviation) to summarize the data.
4. Visualize the data: Create graphs or charts to help visualize the data distribution and identify any
patterns or trends.
5. Interpret the data: Interpret the findings and draw conclusions based on the analysis of the data.
Overall, the goal of describing a dataset is to provide a comprehensive and accurate understanding of
the data and its characteristics. This is important for identifying any trends, patterns, or outliers in the
data, as well as for making informed decisions and drawing meaningful conclusions based on the data.
Identification of variables:
To identify variables in a dataset, follow these steps:
1. Understand the research question: Identify the research question or objective of the dataset. This
will help you determine which variables are relevant.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
2. Examine the data dictionary: Look for a data dictionary or codebook that provides information on
the variables included in the dataset. The data dictionary should provide definitions of each variable
and its possible values.
3. Look at the variable names: Review the variable names to determine what they represent. For example, if a variable is named “age,” it is likely to represent the age of individuals in the dataset.
4. Review the variable values: Look at the values of each variable to determine its range and possible
values. For example, if a variable represents income, it may have values ranging from $0 to $1,000,000.
5. Consider the variable type: Determine the type of each variable, such as categorical or continuous.
Categorical variables have a limited number of values, while continuous variables can take on any
value within a range.
By following these steps, you can identify the variables in a dataset and gain a better understanding of
what they represent. This is important for performing data analysis and drawing meaningful insights
from the data.
Data Distribution:
To examine data distribution, follow these steps:
1. Create a histogram: A histogram is a graph that shows the frequency of values for a given variable. It
provides a visual representation of the distribution of data values.
2. Calculate measures of central tendency: Measures of central tendency, such as mean, median, and
mode, provide insight into the central or typical values in the dataset.
3. Calculate measures of variability: Measures of variability, such as range, variance, and standard
deviation, provide insight into the spread or dispersion of the data.
4. Check for outliers: Outliers are values that are significantly different from the rest of the data. They
can skew the distribution of the data and should be identified and examined separately.
5. Use visualizations: Additional visualizations, such as box plots or density plots, can provide further
insight into the distribution of the data.
By examining the distribution of data, you can gain a better understanding of the characteristics of the
data and identify any patterns or anomalies. This is an important step in data analysis and can inform
the selection of appropriate statistical methods and techniques.
Mastering Statistical Analysis with Excel
94
Example:
Here’s an example of a dataset that you can use to analyze its distribution:
The dataset contains the exam scores of a class of 30 students:
Student ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Exam Score
78
87
65
92
74
85
69
82
91
80
83
72
88
79
67
93
76
84
77
89
70
81
75
90
73
86
68
94
71
95
You can examine the distribution of the exam scores in this dataset by creating a histogram, calculating measures of central tendency and variability, and checking for outliers. This will give you a better
understanding of the characteristics of the data and can inform your analysis and decision-making.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Analyzing data disribution using Excel:
Excel has nifty tools to analyze data distribution within a dataset. Lets use the given dataset to look
out for data distribution using Excel.
First step: Entering data into Excel spreadsheet.
Image showing data entered into spreadsheet
Mastering Statistical Analysis with Excel
96
Descriptive statistics function under Data Analysis tab is utilized to perform this task.
Image showing Data analysis dialog box showing various menu available. Descriptive statistics is
chosen
In the input range field the data field that needs to be analyzed should be entered. The easy way of
doing it is to choose the entire column of data that needs to be analyzed as the cursor is blinking in
the input range field.
In the Grouped By field columns is chosen as the data selected are placed in a column.
If labels are included in the selection then the box in front of labels in first row should be checked.
Under output options the range of cells where the result needs to be displayed should be selected.
This can be achieved by selecting the cells while the cursor in output range field is blinking.
Summary statistics, confidence interval for Mean, Kth largest and Kth smallest can be checked.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
On pressing OK button the result will be displayed on the cells selected in the output range.
Image showing Descriptive statistics dialog box
Mastering Statistical Analysis with Excel
98
Image showing Descriptive statistics output
Putting data into BINS is a convenient way of analyzing a dataset. This will ensure that the user will
be able to create a histogram which is one very useful tool in analyzing data distribution. Using
Excel one can easily create BINS in which data can be placed. This process is known as Binning. In
order to perform this task of Binning one should first enter data into Excel spreadsheet in columns.
In this example two columns can be created (one for student ID and the other for entering the marks
secured by them). Pivot table creation function in Excel is an useful way to perform Binning. The
entire column of data including the header that needs to be analyzed should be selected first. Then
Insert tab should be clicked. This will show some new tabs which include Pivot table.
Drag the table Marks into Fields Rows and Values as shown in the image. By default Excel pivot
table sums up the values. In order to make it count the values field settings should be changed by
clicking on the down arrow which could be seen after the sum of marks. Clicking on the down arrow would open a new menu from which count can be chosen.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Pivot table for the dataset being generated.
Image showing value field settings menu
Mastering Statistical Analysis with Excel
100
Image showing count being chosen in the value field settings
Image showing sum of marks replaced by count of marks
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
In the Pivot table select one data and right click. This will open up a submenu as shown below.
Image showing the submenu Grouping which should be chosen
In the grouping submenu window the start and end values would already be populated. The number
of bins can be entered in the By field. Here 5 is chosen.
Image showing BINS created
Mastering Statistical Analysis with Excel
102
Now is the time to invoke Data Analysis tool which will be seen under Data tab. This will open up
Data Analysis dialog box in which Histogram should be chosen. In the ensuing dialog box data to be
analyzed should be chosen in the range field and the Bins should be chosen in the Bin field. Type of
Histogram is chosen. On clicking OK button histogram would be generated.
Image showing Histogram generated
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Mastering Statistical Analysis with Excel
104
6
Single Factor ANOVA
ingle factor ANOVA (Analysis of Variance) is a statistical method used to test for differences in
means among two or more groups that are formed based on a single categorical independent variable
or factor. It is also known as one-way ANOVA.
S
In single factor ANOVA, the categorical independent variable divides the population into two or
more groups, and the dependent variable is a continuous variable that is measured on each group. The
ANOVA test examines the variability between the groups (due to the differences among group means)
and the variability within the groups (due to the individual differences within each group). The ANOVA test compares the ratio of the between-group variability to the within-group variability, and uses
an F-test to determine if there is a significant difference in means among the groups.
The null hypothesis in single factor ANOVA is that there is no significant difference in means among
the groups, while the alternative hypothesis is that at least one group has a different mean than the
others. If the p-value is less than the significance level (usually 0.05), then we reject the null hypothesis
and conclude that there is a significant difference in means among the groups. If the p-value is greater than the significance level, then we fail to reject the null hypothesis and conclude that there is no
significant difference in means among the groups.
Here is a sample dataset that can be used for a single factor ANOVA:
Suppose we want to test if there is a significant difference in the average weight of apples from three
different orchards:
Orchard A: 50, 55, 58, 60, 65
Orchard B: 45, 50, 52, 55, 60
Orchard C: 40, 45, 47, 50, 55
In this example, the independent variable (factor) is the orchard (A, B, C) and the dependent variable
is the weight of the apples. We can perform a single factor ANOVA to test whether there is a significant difference in the mean weight of apples among the three orchards.
We can enter the data into a software program or statistical calculator to calculate the ANOVA results.
The output will include the F-statistic, the degrees of freedom, the p-value, and other statistics that
Prof Dr Balasubramanian Thiagarajan MS D.L.O
will help us interpret the results of the test.
Manual analysis of this data:
Step 1: Calculate the mean weight of apples for each orchard:
Orchard A: (50 + 55 + 58 + 60 + 65) / 5 = 57.6
Orchard B: (45 + 50 + 52 + 55 + 60) / 5 = 52.4
Orchard C: (40 + 45 + 47 + 50 + 55) / 5 = 47.4
Step 2: Calculate the overall mean weight of apples:
Overall mean = (57.6 + 52.4 + 47.4) / 3 = 52.47
Step 3: Calculate the sum of squares between the groups (SSbetween):
SSbetween = n * sum((group mean - overall mean)^2)
where n is the number of observations in each group.
SSbetween = 5 * [(57.6 - 52.47)^2 + (52.4 - 52.47)^2 + (47.4 - 52.47)^2]
= 387.84
Step 4: Calculate the sum of squares within the groups (SSwithin):
SSwithin = sum((x - group mean)^2)
where x is the weight of each apple.
SSwithin = (50-57.6)^2 + (55-57.6)^2 + (58-57.6)^2 + (60-57.6)^2 + (65-57.6)^2 +
(45-52.4)^2 + (50-52.4)^2 + (52-52.4)^2 + (55-52.4)^2 + (60-52.4)^2 +
(40-47.4)^2 + (45-47.4)^2 + (47-47.4)^2 + (50-47.4)^2 + (55-47.4)^2
= 192.4
Step 5: Calculate the degrees of freedom for between and within groups:
dfbetween = k - 1, where k is the number of groups
dfwithin = N - k, where N is the total number of observations
dfbetween = 3 - 1 = 2
dfwithin = 15 - 3 = 12
Step 6: Calculate the mean squares for between and within groups:
MSbetween = SSbetween / dfbetween = 193.92
MSwithin = SSwitihin / dfwithin = 16.03
Step 7: Calculate the F-statistic:
Mastering Statistical Analysis with Excel
106
F = MSbetween / MSwithin = 12.09
Step 8: Find the p-value:
Using a statistical table or software, we can find the p-value associated with F = 12.09 and dfbetween
= 2, dfwithin = 12, which is less than 0.05. Therefore, we reject the null hypothesis and conclude that
there is a significant difference in the mean weight of apples among the three orchards.
In summary, the results of the single factor ANOVA test show that there is a significant difference in
the mean weight of apples among the three orchards (F(2,12) = 12.09, p < 0.05).
Using Excel to Perform Single Factor Anova:
As a first step the data should be entered into Excel spreadsheet as shown below.
Image showing Data entered
Image showing Single Factor Anova chosen in the Data Analysis screen
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Single factor Anova window
In this window cursor is placed in the input range field. When the cursor starts to blink the entire
dataset that needs to be analyzed is chosen. On selection the values are automatically entered in to
this field. In the Grouped by field columns is chosen. Labels in the first row is checked if the first
row is selected for the input range. Alpha value is left at 0.05. In the output range the cells where the
user desires the result to be displayed is chosen. On clicking the OK button the result will be displayed in the cells chosen under output range.
Image showing the Result displayed
Mastering Statistical Analysis with Excel
108
Single-factor ANOVA (analysis of variance) is used when you want to compare the means of three or
more groups. Specifically, it is used to determine if there are any significant differences between the
means of these groups.
Single-factor ANOVA is appropriate when you have one categorical independent variable (also
called a factor) with three or more levels and one continuous dependent variable. For example, if you
wanted to compare the mean test scores of students who were taught using three different teaching
methods (e.g., lecture, group discussion, online videos), you could use a single-factor ANOVA.
It is important to note that ANOVA assumes that the data is normally distributed and that the variances of each group are equal. If these assumptions are not met, then other statistical tests, such as
nonparametric tests, may be more appropriate.
Pitfalls of Single Factor Anova:
While single-factor ANOVA is a powerful statistical tool for comparing means of multiple groups,
there are several potential pitfalls to be aware of:
1. Assumptions: As mentioned earlier, ANOVA assumes that the data is normally distributed and
that the variances of each group are equal. If these assumptions are not met, the results of ANOVA
may be inaccurate.
2. Sample size: The accuracy and reliability of ANOVA results are dependent on the sample size. If
the sample size is too small, the statistical power may be too low to detect meaningful differences
between groups.
3. Multiple testing: If you perform multiple tests of significance (e.g., comparing the means of many
groups), there is an increased risk of obtaining false positives (type I errors).
4. Outliers: ANOVA is sensitive to outliers in the data. Outliers can skew the results and potentially
obscure meaningful differences between groups.
5. Interpretation: ANOVA only tells you whether there is a significant difference between groups,
but it doesn’t tell you which group(s) differ significantly from each other. Post-hoc tests (e.g., Tukey’s
HSD, Bonferroni correction) can help you identify which groups are significantly different from each
other, but there is still a risk of overinterpreting the results.
Effect size: ANOVA only provides information on whether there is a statistically significant difference between groups, but it does not provide information on the magnitude of the effect. It is important to consider effect size when interpreting the results of ANOVA.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Mastering Statistical Analysis with Excel
110
7
ANOVA: Two-Factor with
Replication
A
NOVA (Analysis of Variance) is a statistical technique used to compare the means of two or more
groups. ANOVA with two factors and replication, also known as two-way ANOVA, is a type of
ANOVA that involves two independent variables, or factors, and multiple observations for each combination of the two factors.
In two-factor ANOVA with replication, we are interested in examining the effect of two factors, A and
B, on a response variable, Y. For each combination of the two factors, we have multiple observations,
or replicates, of the response variable. The factors A and B can be either categorical or continuous
variables, and their levels are typically chosen by the researcher.
The null hypothesis for two-factor ANOVA with replication is that there is no interaction between the
two factors, and that the main effects of each factor are independent of the other factor. The alternative hypothesis is that there is a significant interaction between the two factors, and/or that the main
effects of one or both factors are dependent on the other factor.
To test the null hypothesis, we calculate the sum of squares (SS) and the mean square (MS) for each
factor, as well as the interaction between the two factors. We then use an F-test to determine if the
observed differences between the groups are statistically significant.
If the p-value from the F-test is less than the chosen significance level (e.g., α = 0.05), we reject the null
hypothesis and conclude that there is evidence of a significant difference between at least two groups.
If the p-value is greater than the significance level, we fail to reject the null hypothesis and conclude
that there is not enough evidence to support the claim of a significant difference between groups.
Example:
Here is an example of sample data that could be used to perform ANOVA: Two-Factor with Replication:
Suppose we want to test the effect of two factors, Gender and Age Group, on the average blood pressure of individuals. We randomly select 10 individuals from each age group (20-30, 31-40, and 4150) and of both genders (male and female) and record their blood pressure in mmHg. The data is as
follows:
Prof Dr Balasubramanian Thiagarajan MS D.L.O
Age Group/ Gender Male Female
20-30
20-30
20-30
20-30
20-30
20-30
20-30
20-30
20-30
20-30
31-40
31-40
31-40
31-40
31-40
31-40
31-40
31-40
31-40
31-40
41-50
41-50
41-50
41-50
41-50
41-50
41-50
41-50
41-50
41-50
120
125
130
135
140
145
150
155
160
165
125
130
135
140
145
150
155
160
165
170
135
140
145
150
155
160
165
170
175
180
110
115
120
125
130
135
140
145
150
155
115
120
125
130
135
140
145
150
155
160
125
130
135
140
145
150
155
160
165
170
To analyze this data using ANOVA: Two-Factor with Replication, we would calculate the sum of
squares (SS) and the mean square (MS) for each factor (Gender and Age Group) and for the interaction between the two factors. We would then use an F-test to determine if the observed differences
between the groups are statistically significant.
The null hypothesis for this example is that there is no interaction between gender and age group, and
that the main effects of each factor are independent of the other factor. The alternative hypothesis is
that there is a significant interaction between the two factors, and/or that the main effects of one or
both factors are dependent on the other factor.
If the p-value from the F-test is less than the chosen significance level (e.g., α = 0.05), we reject the null
hypothesis and conclude that there is evidence of a significant difference between at least two groups.
Mastering Statistical Analysis with Excel
112
If the p-value is greater than the significance level, we fail to reject the null hypothesis and conclude
that there is not enough evidence to support the claim of a significant difference between groups.
Calculation:
First, we need to calculate the grand mean, which is the average of all the observations:
Grand Mean = (120+110+125+115+...+175+165+180+170)/30 = 144.33
Next, we calculate the sum of squares (SS) for each factor and the interaction between the two factors:
SS_Total = ΣΣ(Y_ij - Grand Mean)^2 = 14140.00
SS_Gender = n*(Σ(Male_i - Grand Mean)^2 + Σ(Female_i - Grand Mean)^2) = 1160.00
SS_AgeGroup = n*(Σ(AgeGroup_i - Grand Mean)^2) = 5520.00
SS_Interaction = SS_Total - SS_Gender - SS_AgeGroup = 4460.00
where n is the number of observations for each combination of the factors (in this case, n = 10).
Next, we calculate the degrees of freedom (df) for each factor and the interaction:
df_Total = N - 1 = 29
df_Gender = k - 1 = 1 (where k is the number of levels of the Gender factor, in this case k = 2)
df_AgeGroup = j - 1 = 2 (where j is the number of levels of the Age Group factor, in this case j = 3)
df_Interaction = (k-1)*(j-1) = 2
Next, we calculate the mean square (MS) for each factor and the interaction:
MS_Gender = SS_Gender / df_Gender = 1160.00 / 1 = 1160.00
MS_AgeGroup = SS_AgeGroup / df_AgeGroup = 5520.00 / 2 = 2760.00
MS_Interaction = SS_Interaction / df_Interaction = 2230.00
Finally, we calculate the F-statistic and the associated p-value for each factor and the interaction:
F_Gender = MS_Gender / MS_Interaction = 0.52, p-value = 0.616
F_AgeGroup = MS_AgeGroup / MS_Interaction = 1.24, p-value = 0.313
F_Interaction = MS_Interaction / MS_Error = 1.27, p-value = 0.308
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
where MS_Error is the mean square for error, which is calculated as the residual variance after accounting for the effects of the factors.
Since all three p-values are greater than the chosen significance level (e.g., α = 0.05), we fail to reject
the null hypothesis and conclude that there is not enough evidence to support the claim of a significant difference between groups. Therefore, we can conclude that there is no significant interaction
between gender and age group, and that the main effects of each factor are independent of the other
factor.
The same calculation can be performed using Excel Statistical Analysis feature.
Steps involved:
Data entry:
The entire data is entered into Excel in columns and rows as shown in the image below.
Image showing data entered into Excel spreadsheet
Mastering Statistical Analysis with Excel
114
After entering the data in Excel spreadsheet the user should click on the data tab. This will reveal
Data Analysis tab. On clicking Data Analysis tab the Data Analysis menu will open up. In the menu
the user can choose Anova-Two factor with Replication is chosen before clicking OK button.
Image showing Data Analysis menu
On clicking on the OK button ANOVA: With Two factor Replication menu opens up.
Cursor is placed inside the input range field and clicked. When the cursor starts to blink the entire
data containing cells are selected. This will ensure the addresses of these cells are entered into the
input range field.
The next field that says rows per sample 10 is keyed in. Since each group has 10 samples.
Alpha value is left as 0.05 which is the default setting.
In the output options click on the New worksheet radio tab.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing ANOVA: Two Factor with Replication menu with the relevant fields inputted
On clicking the OK button the result will be displayed in a separate sheet.
Image showing P value displayed
Mastering Statistical Analysis with Excel
116
Anova: Two-Factor With Replication
Entire result:
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Pitfalls:
Two-factor ANOVA with replication is a statistical method used to analyze the effects of two categorical independent variables on a continuous dependent variable. Although it is a useful tool, there are
several pitfalls to be aware of when using this method:
1. Violation of normality assumption: Two-factor ANOVA assumes that the dependent variable follows a normal distribution. If this assumption is violated, the results of the analysis may be biased.
2. Violation of equal variances assumption: Two-factor ANOVA also assumes that the variance of the
dependent variable is equal across all levels of the independent variables. Violation of this assumption can lead to incorrect conclusions and false positives.
3. Interaction effect: Two-factor ANOVA assumes that the two independent variables do not interact
with each other. However, if there is an interaction effect between the two factors, it can affect the
interpretation of the main effects.
4. Sample size: Two-factor ANOVA with replication requires a large sample size to achieve sufficient
statistical power. If the sample size is too small, the results may not be reliable.
5. Multiple comparisons: Two-factor ANOVA with replication can lead to multiple comparisons,
which increases the risk of false positives. It is important to correct for multiple comparisons to avoid
drawing incorrect conclusions.
6. Missing data: Missing data can be a problem in two-factor ANOVA with replication, as it reduces
the power of the analysis and can bias the results if not handled properly.
7. Assumption of independence: Two-factor ANOVA assumes that the observations are independent.
If there is dependence between observations, such as in a repeated measures design, then two-factor
ANOVA may not be appropriate.
Scenario for using this statistical evaluation:
Two-way ANOVA with replication can be used when you have two independent variables (also
known as factors) and the same individuals or subjects are tested in each combination of the two
variables (hence the term “replication”). In other words, each individual or subject is tested under all
possible combinations of the two variables.
For example, suppose you want to investigate the effect of two different factors (e.g., temperature and
humidity) on the growth of a certain plant. You would randomly assign plants to different combinations of temperature (low, medium, high) and humidity (low, medium, high) levels and measure
the growth of the plants. Each combination of temperature and humidity levels is tested on multiple
plants, which is why this is called “replication”.
In this scenario, a two-way ANOVA with replication can be used to determine if there is a significant
main effect of temperature, a significant main effect of humidity, or a significant interaction effect
Mastering Statistical Analysis with Excel
118
between temperature and humidity on plant growth.
Two-way ANOVA with replication has several advantages:
1. It allows you to investigate the effects of two independent variables (factors) and their interaction
on the dependent variable. This can help you identify complex relationships between variables that
cannot be observed using simple one-way ANOVA.
2. It takes into account the variability in the data due to both factors and their interaction, which can
lead to more accurate estimates of the true effects of the factors.
3. The replication of measurements within each combination of the factors increases the statistical
power of the analysis, meaning that you are more likely to detect significant effects if they exist.
4. Two-way ANOVA with replication can also help you to determine whether one factor has a stronger effect on the dependent variable than the other factor or whether their effects are comparable.
5. It can provide useful information for the planning of future experiments by identifying which factor or factors have the strongest effect on the dependent variable and whether there is an interaction
between them that needs to be considered in future studies.
Overall, two-way ANOVA with replication is a powerful statistical tool that can help you to better
understand the relationships between variables in your data and to make more informed decisions
based on your results.
Values to look out for in this test:
When performing a two-way ANOVA with replication, several values are important to look at to
interpret the results properly. Here are some of the key values to consider and their importance:
Sum of Squares (SS): SS is the measure of the total variation in the dependent variable that is attributed to the independent variables (factors) and their interaction. It helps to determine the extent
of the effects of the independent variables on the dependent variable.
Degrees of Freedom (df): df is the number of independent pieces of information used to estimate
the variance. The number of degrees of freedom for each factor and their interaction determines the
F-statistic, which is used to test the significance of the effect of each factor and their interaction.
Mean Square (MS): MS is the sum of squares divided by the degrees of freedom. It helps to determine the variance explained by each independent variable or factor and their interaction.
F-Statistic: The F-statistic is the ratio of the mean square due to the factor or interaction to the mean
square error. It tests whether the variation due to the factor or interaction is significantly different
from the variation due to random error.
p-value: The p-value is the probability of obtaining a test statistic as extreme or more extreme than
the one observed if the null hypothesis is true. It helps to determine whether the effect of each factor
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
or interaction is statistically significant or due to chance.
Effect size: Effect size measures the strength of the relationship between the independent variables
and the dependent variable. It can help to interpret the practical significance of the results.
In summary, the values to look for in a two-way ANOVA with replication are the sum of squares,
degrees of freedom, mean square, F-statistic, p-value, and effect size. These values are important for
determining the statistical significance and practical importance of the effect of each factor and their
interaction on the dependent variable.
Interpreting the results of a two-way ANOVA with replication involves examining several values,
including the F-statistic, p-value, mean square, and effect size. Here are some general guidelines for
interpreting the results:
1. Look at the F-statistic: The F-statistic tests whether the variation due to the factors or interaction
is significantly different from the variation due to random error. If the F-statistic is large and the
p-value is less than the significance level (typically 0.05), then you can conclude that at least one of
the factors or the interaction is statistically significant.
2. Look at the p-value: The p-value indicates the probability of obtaining the observed results by
chance. If the p-value is less than the significance level, then you can conclude that the effect of at
least one factor or the interaction is statistically significant.
3. Look at the mean square: The mean square helps to determine the variance explained by each factor or their interaction. If the mean square is large, then the factor or interaction may have a strong
effect on the dependent variable.
4. Look at the effect size: Effect size measures the strength of the relationship between the independent variables and the dependent variable. A large effect size indicates that the factors or interaction
have a strong practical significance, while a small effect size may have less practical significance.
5. Consider the nature of the factors and the interaction: Depending on the research question and
the factors involved, the interpretation of the results may vary. It is important to consider the nature
of the factors and the interaction when interpreting the results.
In summary, interpreting the results of a two-way ANOVA with replication involves considering the
F-statistic, p-value, mean square, and effect size. These values help to determine the statistical and
practical significance of the effects of the factors and their interaction on the dependent variable.
Mastering Statistical Analysis with Excel
120
8
Anova: Two-Factor without Replication
A
NOVA (Analysis of Variance) is a statistical method used to analyze the differences between the
means of two or more groups. In a two-factor ANOVA without replication, there are two independent variables (factors) and each level of one factor is combined with each level of the other factor
to form all possible combinations. However, each combination is measured only once, resulting in no
replication of measurements.
For example, let’s say we want to study the effect of two factors, temperature and humidity, on plant
growth. We would have different levels of temperature (low, medium, high) and humidity (low, medium, high) and we would measure the growth of the plants under all possible combinations of temperature and humidity (i.e., low temperature and low humidity, low temperature and medium humidity, etc.). However, each combination is measured only once.
The main difference between a two-factor ANOVA with and without replication is that with replication, each combination is measured multiple times, leading to more accurate estimates of the true
effects of the factors and their interaction. Without replication, there is less information available to
estimate the variability due to the factors and their interaction.
In summary, a two-factor ANOVA without replication is a statistical method used to analyze the
differences between the means of groups formed by combining all possible levels of two independent
variables without replication. However, the lack of replication limits the accuracy of the estimates of
the true effects of the factors and their interaction.
In general, a two-factor ANOVA without replication is considered less ideal than a two-factor ANOVA with replication, as replication allows for more accurate estimation of the true effects of the factors
and their interaction. However, there may be some scenarios where a two-factor ANOVA without
replication is the only option or is still informative.
One scenario where a two-factor ANOVA without replication may be the only option is when it is not
feasible or practical to perform replication due to time, cost, or ethical constraints. For example, if
the study involves a rare or endangered species that cannot be easily obtained or if the measurements
involve invasive procedures that can only be performed once on each subject.
Another scenario where a two-factor ANOVA without replication may still be informative is when
the effects of the factors and their interaction are expected to be large and the variability due to other
sources is expected to be small. In this case, even though the lack of replication may lead to less accuProf Dr Balasubramanian Thiagarajan MS D.L.O
rate estimates of the true effects of the factors and their interaction, the large effect sizes may still be
detectable with reasonable power.
In summary, a two-factor ANOVA without replication may be used when replication is not feasible or
practical or when the effects of the factors and their interaction are expected to be large and the variability due to other sources is expected to be small. However, in general, a two-factor ANOVA with
replication is preferred as it allows for more accurate estimation of the true effects of the factors and
their interaction.
Here is a sample dataset that can be analyzed using a two-factor ANOVA without replication:
Suppose we want to study the effect of two factors, fertilizer type and watering frequency, on plant
height. We have three levels of fertilizer type (A, B, C) and three levels of watering frequency (low,
medium, high). We randomly assign each plant to one of the nine possible combinations of fertilizer
type and watering frequency and measure its height at the end of the growing season.
Fertilizer Watering
Frequency Plant Height
-----------------------------------------------A
Low
10.5
A
Medium
12.3
A
High
15.6
B
Low
11.2
B
Medium
13.1
B
High
14.9
C
Low
9.8
C
Medium
11.4
C
High
16.2
Reasons for using a two-factor ANOVA without replication in this scenario could include limitations in resources or time, where it may not be feasible to measure each combination of fertilizer type
and watering frequency multiple times. Additionally, if the effects of the fertilizer type and watering
frequency are expected to be large relative to other sources of variation, a two-factor ANOVA without
replication may still provide useful insights into the study’s research question. However, it is important to note that this design has limitations and that the results should be interpreted with caution. In
general, a two-factor ANOVA with replication is preferred as it provides more accurate estimation of
the true effects of the factors and their interaction.
Mastering Statistical Analysis with Excel
122
To perform a two-factor ANOVA without replication on the above data, we can use the following
steps:
1. Calculate the means for each combination of fertilizer type and watering frequency.
2. Calculate the overall mean of all plant heights.
3. Calculate the sum of squares (SS) for each source of variation: the fertilizer type, the watering frequency, and the interaction between the two factors.
4. Calculate the degrees of freedom (df) for each source of variation: df1 = k-1, where k is the number of levels for the factor, df2 = n-k, where n is the total number of observations and k is the number of levels for the factor.
5. Calculate the mean squares (MS) for each source of variation: MS = SS/df.
6. Calculate the F-ratio for each source of variation: F = MS(factor)/MS(error).
7. Compare the F-ratio for each source of variation to the critical F-value for the corresponding degrees of freedom and level of significance (usually alpha=0.05).
If the F-ratio is greater than the critical F-value, reject the null hypothesis and conclude that there
is a significant effect of the factor on plant height. Otherwise, fail to reject the null hypothesis and
conclude that there is no significant effect of the factor on plant height.
We can use two-factor ANOVA without replication to analyze the given data. Here, the two factors
are Fertilizer and Watering Frequency, and the response variable is Plant Height.
First, we need to calculate the sum of squares for each factor and the interaction between them,
along with the degrees of freedom and mean squares. The following table shows the calculations:
Source
DF SS
MS
F
------------------------------------------------Fertilizer 2 18.34
9.17 1.25
Watering
Freq.
2 41.84
20.92
Interaction 4 26.84
6.71
Error
2
1.26
Total
10
2.52
2.85*
0.91
89.44
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
*The F-value for Watering Frequency exceeds the critical value for a significance level of 0.05, indicating that the main effect of Watering Frequency is significant.
The ANOVA table shows that the main effect of Fertilizer is not statistically significant, as the F-value
for Fertilizer is less than the critical value for a significance level of 0.05. However, the main effect
of Watering Frequency is statistically significant, as the F-value for Watering Frequency exceeds the
critical value for a significance level of 0.05.
The interaction between Fertilizer and Watering Frequency is not statistically significant, as the
F-value for Interaction is less than the critical value for a significance level of 0.05.
Next, we can perform a post-hoc analysis to determine which levels of Watering Frequency are
significantly different from each other. We can use Tukey’s HSD test for this purpose. The following
table shows the pairwise comparisons:
Diff. 95% Conf. Interval
p-value
---------------------------------------------Low-Med 1.400
(-1.944, 4.744) 0.382
Low-High 4.867
(1.423, 8.310) 0.010
Med-High 3.467
(-0.977, 7.911) 0.123
The results show that there are statistically significant differences between the levels of Watering Frequency, with the High frequency level producing significantly taller plants than the Low and Medium frequency levels.
In conclusion, the two-factor ANOVA without replication shows that the main effect of Fertilizer is
not statistically significant, while the main effect of Watering Frequency is statistically significant.
The interaction between Fertilizer and Watering Frequency is not statistically significant. The posthoc analysis using Tukey’s HSD test confirms that there are statistically significant differences between the levels of Watering Frequency, with the High frequency level producing significantly taller
plants than the Low and Medium frequency levels.
Using Excel to perform this test:
The first step is to enter the data into Excel in columns and rows.
After the data has been entered Data tab is clicked.
It opens up another set of tabs.
Data Analysis tab is clicked next.
It opens up the Data Analysis window.
In the data analysis window
Mastering Statistical Analysis with Excel
124
Image showing Data entered into Excel spreadsheet
In the data analysis menu Anova: Two-Factor Without Replication is chosen.
On clicking OK button Two factor Anova without replication screen opens up.
Image showing Data analysis screen where Anova: Two-Factor without Replication is chosen before
clicking OK button
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
In the ensuing Anova with Two-Factor withour Replication menu the input range field is clicked.
When the cursor starts to blink the column containing Numeric values (Plant height) in this case is
chosen. If the header is chosen the radio button before Lables should be checked. The Alpha value
is left with the default settings (0.05). In the output field the cells where the result is to be displayed
is chosen and entered. On clicking the OK button the result will be displayed in the output range of
cells.
Image showing Anova: Two-Factor without Replication window with the values entered.
Mastering Statistical Analysis with Excel
126
Image showing the result displayed
Pirfalls:
Two-factor ANOVA without replication is a statistical technique used to analyze the effects of two
independent variables on a dependent variable, where each independent variable has two or more
levels. However, there are several pitfalls to using this technique without replication, including:
1. Lack of precision: Without replication, the variability in the data cannot be adequately estimated,
which leads to less precise estimates of the effects of the independent variables on the dependent
variable.
2. Inability to detect interaction effects: Interaction effects occur when the effect of one independent
variable on the dependent variable is dependent on the level of the other independent variable. Without replication, it may be difficult to detect these interaction effects, leading to inaccurate conclusions about the relationship between the independent and dependent variables.
3. Difficulty in generalizing results: The lack of replication makes it difficult to generalize the results
to other populations or settings. Replication allows for a more robust and reliable estimate of the true
effect of the independent variables on the dependent variable.
4. Increased risk of Type I errors: Type I errors occur when a significant effect is detected when there
is no true effect. Without replication, there is an increased risk of Type I errors due to the lack of
precision in the estimates of the effects of the independent variables on the dependent variable.
Overall, while two-factor ANOVA without replication can provide some insight into the relationship between two independent variables and a dependent variable, it is important to be aware of the
limitations and potential pitfalls of this technique. It is often preferable to use a design that includes
replication in order to obtain more precise and reliable estimates of the effects of the independent
variables on the dependent variable.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Mastering Statistical Analysis with Excel
128
9
Correlation
I
n statistics, correlation refers to the degree to which two or more variables are related to each other. It
is a measure of the strength and direction of the linear relationship between two variables. Correlation
is typically expressed as a correlation coefficient, which is a numerical value that ranges from -1 to 1.
A positive correlation indicates that as one variable increases, the other variable tends to increase as
well. A negative correlation indicates that as one variable increases, the other variable tends to decrease. A correlation coefficient of 0 indicates that there is no linear relationship between the two
variables.
Correlation is used in many fields, including economics, finance, psychology, and biology. It can help
researchers understand the relationships between different variables and how they may influence each
other. Correlation does not, however, imply causation, and it is important to be cautious when interpreting correlation results. Other factors may be influencing the variables being studied, and correlation does not necessarily indicate a cause-and-effect relationship between them.
Correlation plays a crucial role in biomedical research, as it helps researchers understand the relationships between different variables in the context of health and disease. Some specific roles of correlation in biomedical research include:
1. Identification of risk factors: Correlation can be used to identify potential risk factors for a disease
or health condition. For example, researchers might use correlation to identify variables that are associated with an increased risk of heart disease, such as high blood pressure, high cholesterol, or smoking.
2. Prediction of outcomes: Correlation can also be used to predict the likelihood of certain outcomes
based on the presence or absence of certain variables. For example, researchers might use correlation
to predict the likelihood of a patient developing complications after surgery based on their age, health
status, and other factors.
3. Monitoring disease progression: Correlation can be used to track the progression of a disease over
time and identify factors that may be influencing the disease course. For example, researchers might
use correlation to monitor the relationship between changes in tumor size and changes in biomarker
levels in cancer patients.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
4. Identifying treatment targets: Correlation can be used to identify potential treatment targets by
identifying variables that are strongly correlated with disease severity or progression. For example,
researchers might use correlation to identify proteins or genes that are strongly correlated with disease
progression in order to develop targeted therapies.
Overall, correlation is a powerful tool for identifying relationships between variables in biomedical
research. By understanding the correlations between different variables, researchers can better understand the underlying mechanisms of disease and develop more effective prevention and treatment
strategies.
Here is an example of sample data that can be used for correlation analysis:
Suppose we want to examine the correlation between a person’s age and their weight. We collect data
from a random sample of 20 people, and record their age (in years) and weight (in pounds). The data
might look like this:
Age (years)
32
43
26
39
41
29
52
31
47
34
28
45
37
48
30
42
25
35
36
27
Weight (lbs)
145
180
130
175
200
140
220
150
195
160
135
190
170
200
155
185
125
165
170
135
We can then use correlation analysis to determine the strength and direction of the relationship
between age and weight. We would calculate the correlation coefficient (such as Pearson’s correlation
coefficient) and determine if the correlation is statistically significant. This analysis can help us understand if there is a relationship between age and weight in our sample, and can inform future research
or interventions related to these variables.
Mastering Statistical Analysis with Excel
130
Steps for calculating the correlation coefficient (Pearson’s r) using the sample data provided in my
previous answer:
Step 1: Calculate the mean (average) and standard deviation of both variables (age and weight) using
Excel or a statistical software package. For this example, the means and standard deviations are:
Mean age = 36.05 years, standard deviation = 8.61 years
Mean weight = 170.5 lbs, standard deviation = 29.91 lbs
Step 2: Calculate the deviations from the mean for each data point. This involves subtracting the
mean value of each variable from the actual value for that data point. For example, for the first data
point (32 years, 145 lbs), the deviations from the mean would be:
Deviation from mean age = 32 - 36.05 = -4.05
Deviation from mean weight = 145 - 170.5 = -25.5
Step 3: Calculate the product of the deviations for each data point. This involves multiplying the
deviation from the mean for age by the deviation from the mean for weight for each data point. For
example, for the first data point, the product of the deviations would be:
Product of deviations = (-4.05) x (-25.5) = 103.275
Step 4: Calculate the sum of the product of deviations for all data points. For this example, the sum
of the product of deviations is 25,995.675.
Step 5: Calculate the correlation coefficient (Pearson’s r) using the formula:
r = sum of product of deviations / (n-1) x (standard deviation of age) x (standard deviation of
weight)
For this example, the correlation coefficient (r) is:
r = 25,995.675 / (19 x 8.61 x 29.91) = 0.67
This indicates a moderate positive correlation between age and weight in the sample data. We can
also calculate a p-value to determine if the correlation is statistically significant at a certain level of
significance (e.g. p < 0.05).
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Using Excel to calculate correlation:
You can use Excel to calculate the correlation coefficient (Pearson’s r) for the sample data as follows:
Step 1: Enter the data into an Excel worksheet. For this example, you can enter the Age data into
column A and the Weight data into column B.
Step 2: Select an empty cell where you want to display the correlation coefficient.
Step 3: Type the following formula into the empty cell:
=CORREL(A2:A21,B2:B21)
This formula uses the “CORREL” function in Excel to calculate the correlation coefficient between
the two columns of data (A2:A21 and B2:B21, respectively). Note that the first row of the data is
assumed to be column headings, so we start at row 2.
Step 4: Press “Enter” to calculate the correlation coefficient.
Excel will display the correlation coefficient in the selected cell. For the sample data provided earlier,
the correlation coefficient should be approximately 0.67, indicating a moderate positive correlation
between age and weight. If you want to further interpret the correlation coefficient, you can calculate
the p-value to determine if the correlation is statistically significant.
Image showing formula for calculating correlation coefficient entered
Mastering Statistical Analysis with Excel
132
Image showing Correlation coefficient displayed on pressing Entry Key
Correlation coefficient can also be calculated from Data Analysis Menu in Excel. This does not involve any keying of formula by the user.
Step I : Data entry into Excel spreadsheet
Step II: Data tab is clicked. It exposes another set of tabs. Data Analysis tab which happens to be
one of those tabs is clicked next. It opens up the Data Analysis window. In the data analysis window
Correlation is selected and OK button is clicked. This opens up the correlation window.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Data Analysis window where Correlation is chosen
Image showing correlation window where the fields are entered
Mastering Statistical Analysis with Excel
134
In the Correlation window cursor is placed over Input range field. When the cursor starts to blink
the rows containing numerical data are selected. If the header is also selected then the Labels in the
First row box is checked. In the output range field the cursor is placed and clicked. When it starts
to blink the cell where the result needs to be displayed is selected. The address of the selected cell
would be automatically entered into this field. On clicking the OK button result would be displayed.
Image showing the result displayed
Perfectly correlated data would reveal a correlation score as 1. Any value close to 1 indicates that the
data studied are correlated.
The ideal graph for correlation is a scatter plot, which is a type of graph that displays the relationship
between two quantitative variables.
In a scatter plot, each data point is represented by a point on the graph, with the x-axis representing
one variable and the y-axis representing the other variable. If there is a strong positive correlation between the two variables, the points will tend to cluster in a line sloping upwards from left to right. If
there is a strong negative correlation, the points will tend to cluster in a line sloping downwards from
left to right. If there is no correlation, the points will be scattered randomly across the graph.
In addition to the scatter plot, other types of graphs can also be useful for displaying correlations,
such as line graphs or heat maps. However, the scatter plot remains the most widely used and versatile graph for visualizing correlations between two quantitative variables.
Creating scatterplot to identify correlation using Excel:
Scatterplot can be created in Excel by selecting the data columns and then clicking on the insert
menu tab. Then Recommended charts tab is clicked. This will open up all the possible recommended charts that can be created for this data. In the window that opens Scatterplot is chosen. Excel
then displays the data as scatterplot in the spreadsheet.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Scatterplot created
Mastering Statistical Analysis with Excel
136
Types of Correlation:
Correlation is a statistical technique used to measure the strength and direction of the relationship
between two variables. There are different types of correlation, including:
1. Pearson Correlation: The Pearson correlation measures the linear relationship between two continuous variables. It ranges from -1 to 1, with values close to -1 indicating a strong negative correlation, values close to 1 indicating a strong positive correlation, and values close to 0 indicating no
correlation.
To use Pearson’s correlation, you need to have two continuous variables that are normally distributed. Here is an example of test data that could be used to calculate Pearson’s correlation:
Variable 1: Height (in inches) of 10 individuals - 65, 67, 68, 70, 72, 73, 74, 76, 77, 79
Variable 2: Weight (in pounds) of the same 10 individuals - 130, 140, 145, 155, 165, 170, 180, 190,
195, 200
You can use Pearson’s correlation to determine if there is a relationship between height and weight.
A positive correlation would indicate that as height increases, weight tends to increase as well. A
negative correlation would indicate that as height increases, weight tends to decrease. A correlation
coefficient close to zero would indicate that there is no relationship between the two variables.
To calculate Pearson’s correlation using Excel, you can use the “CORREL” function. Here are the
steps:
1. Enter your data into two columns in an Excel worksheet.
2. Click on an empty cell where you want to display the correlation coefficient.
3. Type “=CORREL(“ into the cell, then select the range of data for the first variable, type a comma,
and then select the range of data for the second variable. For example, if your data is in columns A
and B from rows 2 to 10, you would type “=CORREL(A2:A10,B2:B10)”.
4. Press Enter to calculate the correlation coefficient. The result will be a value between -1 and 1,
where a value close to 1 indicates a strong positive correlation, a value close to -1 indicates a strong
negative correlation, and a value close to 0 indicates no correlation.
Note that Excel also provides a correlation matrix function that can be used to calculate the correlation coefficients for multiple pairs of variables simultaneously.
2. Spearman Correlation: The Spearman correlation measures the monotonic relationship between
two variables, which means it captures the direction and strength of the relationship between the
variables without assuming that the relationship is linear. It ranges from -1 to 1, with values close to
-1 indicating a strong negative monotonic correlation, values close to 1 indicating a strong positive
monotonic correlation, and values close to 0 indicating no monotonic correlation.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Spearman’s correlation is used when the relationship between two variables is not linear, but rather a
monotonic relationship exists. A monotonic relationship is one in which the variables tend to increase or decrease together, but not necessarily at a constant rate.
Spearman’s correlation is useful when the data are ordinal, meaning that the values represent an ordered scale, but the exact differences between the values are not meaningful. For example, if you are
examining the relationship between the rank of students in a class and their test scores, the rank is an
ordinal variable because it represents the order of the students, but the exact difference between the
ranks is not meaningful. In this case, Spearman’s correlation would be more appropriate than Pearson’s correlation because Pearson’s correlation assumes that the relationship between the variables is
linear.
Spearman’s correlation is also robust to outliers and non-normality in the data, making it a good
choice when the assumptions of normality and homoscedasticity are not met.
In summary, Spearman’s correlation should be used when:
The relationship between two variables is not linear, but rather a monotonic relationship exists.
The data are ordinal in nature.
The assumptions of normality and homoscedasticity are not met.
The presence of outliers is suspected.
To calculate Spearman’s correlation using Excel, you can use the “CORREL” function with an additional argument that specifies the rank of each value. Here are the steps:
1. Enter your data into two columns in an Excel worksheet.
2. Click on an empty cell where you want to display the correlation coefficient.
3. Type “=SPEARMAN(“ into the cell, then select the range of data for the first variable, type a comma, and then select the range of data for the second variable. For example, if your data is in columns
A and B from rows 2 to 10, you would type “=SPEARMAN(A2:A10,B2:B10,”.
4. Add the rank function arguments for both ranges separated by a semicolon (;) in the end. The
RANK function returns the rank of a number within a range of numbers. The syntax for the RANK
function is RANK(number, ref, [order]).
For example, if you have your data in columns A and B from rows 2 to 10, you would type:
“=SPEARMAN(A2:A10,B2:B10,RANK(A2:A10),RANK(B2:B10))”.
5. Press Enter to calculate the correlation coefficient. The result will be a value between -1 and 1,
where a value close to 1 indicates a strong positive correlation, a value close to -1 indicates a strong
negative correlation, and a value close to 0 indicates no correlation.
Note that you can also use Excel’s “RANK.EQ” function to rank the values if you want to handle
ties differently. In this case, the function would look like “=SPEARMAN(A2:A10,B2:B10,RANK.
EQ(A2:A10),RANK.EQ(B2:B10))”.
Mastering Statistical Analysis with Excel
138
Example:
Two variables are used in this case.
Variable 1 - Age
Variable 2 - BMI
These values are entered into Excel spreadsheet.
In order to calculate Spearman’s correlation coefficient, the first thing that should be done is to rank
both these variables. For this purpose two more variable columns are created in Excel:
1. Rank Age
2. Rank BMI
Totally 4 colums are created
Image showing Data entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
In the next step Ranking for both these variables should be calculated. To rank the Age variable the
following formula is used:
=RANK.AVG(A2,A2:A12,1)
Image showing formula for calculating Range entered into the cell
Mastering Statistical Analysis with Excel
140
Image showing Range value displayed when ENTER key is pressed
In order to use autofill feature of Excel in calculating the Range one needs to use the following formula in order to fix the rows. This is customized to this example.
=RANK.AVG(A2,$A2:$A$12,1)
Image showing Range value displayed for Age
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Calculation of Rank BMI column.
The following formula is used to calculate RANK BMI column:
=RANK.AVG(B2,$B$2:$B$12,1)
The cell where data needs to be displayed is selected and the above formula is entered. The same will
be reflected in the formula bar. On pressing ENTER key the calculated value would be displayed.
On pulling the cell handle downwards the subsequent lower cells will also be filled up with the calculated values. The user need not key in the formula to each cell in order to perform the calculation.
The autofill feature would do the job.
Image showing calculation under Rank BMI column displayed when ENTER key is pressed.
Mastering Statistical Analysis with Excel
142
Image showing all the cells under Rank BMI filled with data. Auto fill feature of Excel can be utilized
for this purpose by pulling the handle ( a small dot at the bottom right corner of the last cell) downwards.
Correlation coefficinet can be calculated by using CORREL function. The formula used is:
=CORREL(C1:C12,D1:D12)
This formula should be entered into the empty cell where the correlation value is to be printed. On
the press of ENTER key the value would be displayed.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Correlation value displayed on pressing ENTER key
3. Kendall Correlation: The Kendall correlation measures the strength of the association between two
variables based on the number of concordant and discordant pairs of observations. It ranges from
-1 to 1, with values close to -1 indicating a strong negative association, values close to 1 indicating a
strong positive association, and values close to 0 indicating no association.
Kendall correlation is a measure of the strength and direction of association between two variables
that are both ordinal (i.e., measured on an ordinal scale). It measures the similarity in the orderings
of the values of the two variables, regardless of their magnitudes.
The Kendall correlation coefficient is denoted by the symbol tau (τ) and ranges from -1 to 1, with -1
indicating a perfect negative association, 1 indicating a perfect positive association, and 0 indicating
no association between the variables.
Kendall correlation is used in a variety of fields, such as social sciences, economics, psychology, and
biology, to analyze relationships between variables that are not necessarily normally distributed or
linearly related. It is particularly useful when dealing with nonparametric data, where the variables
are ranked or ordered, rather than measured on an interval or ratio scale.
Some common applications of Kendall correlation include studying the association between the
ranks of students in a class and their test scores, investigating the relationship between the order of
finish in a race and the age of the participants, or examining the correlation between the rankings of
different brands of a product and their sales performance.
Here is a test data set you can use to perform Kendall correlation in Excel:
Mastering Statistical Analysis with Excel
144
Variable 1
3
2
1
4
5
6
8
7
Variable 2
2
1
3
5
4
6
7
8
To perform Kendall correlation on this data set in Excel, select a cell where you want the correlation
coefficient to appear and use the formula =CORREL(A2:A9, B2:B9), assuming that “Variable 1” is in
column A and “Variable 2” is in column B. Press enter, and Excel will calculate the Kendall correlation coefficient, which should be approximately 0.4286.
Data is entered into Excel spreadsheet as shown below:
Image showing Data entered into Excel spreadsheet
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the formula for calculating Kendall correlation entered into the cell where the result
should be displayed
Image showing Kendal’s correlation value displayed on pressing ENTER key. Since this value is close
to 1 the dataset should be considered to be positively correlated.
Mastering Statistical Analysis with Excel
146
4. Point-Biserial Correlation: The Point-Biserial correlation measures the relationship between a
dichotomous variable and a continuous variable. It ranges from -1 to 1, with values close to -1 indicating a negative relationship, values close to 1 indicating a positive relationship, and values close to
0 indicating no relationship.
Point-Biserial Correlation is a statistical measure that examines the relationship between two variables: one dichotomous (binary) and one continuous. The Point-Biserial Correlation coefficient (rpb)
indicates the strength and direction of the linear association between the dichotomous variable and
the continuous variable.
The Point-Biserial Correlation coefficient ranges from -1 to +1. A correlation of +1 indicates a perfect positive relationship, meaning that as one variable increases, so does the other. A correlation of
-1 indicates a perfect negative relationship, meaning that as one variable increases, the other decreases. A correlation of 0 indicates no relationship between the two variables.
The Point-Biserial Correlation is often used in research when one variable is a binary variable (e.g.,
gender, yes/no response to a question, etc.) and the other is a continuous variable (e.g., age, income,
etc.). For example, researchers might use the Point-Biserial Correlation to explore the relationship
between gender (a binary variable) and salary (a continuous variable) to see if there is a significant
difference in pay between men and women in a particular industry.
The Point-Biserial Correlation can be calculated using statistical software such as SPSS or Excel. It
can also be calculated by hand using the formula:
rpb = (M1 - M0) / (SD0 + SD1)
Where M1 is the mean of the continuous variable for cases where the binary variable is 1, M0 is the
mean of the continuous variable for cases where the binary variable is 0, SD1 is the standard deviation of the continuous variable for cases where the binary variable is 1, and SD0 is the standard
deviation of the continuous variable for cases where the binary variable is 0.
To perform a Point-Biserial Correlation using Excel, you can follow these steps:
1. Open a new Excel spreadsheet and enter your data into two columns. One column should contain
the dichotomous variable, and the other column should contain the continuous variable.
2. Calculate the mean and standard deviation of the continuous variable.
3. Calculate the proportion of “successes” (i.e., the presence of the dichotomous variable) in the sample.
4. Calculate the point-biserial correlation coefficient using the formula:
r_pb = (mean of the continuous variable for the “successes” – mean of the continuous variable for
the “failures”) / (standard deviation of the continuous variable) * (square root of the proportion of
“successes” * proportion of “failures”)
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
4. Use Excel’s built-in function, “CORREL,” to calculate the point-biserial correlation coefficient.
To do this, highlight the two columns containing your data and enter the following formula into an
empty cell:
=CORREL(dichotomous variable column, continuous variable column)
5. Make sure to replace “dichotomous variable column” and “continuous variable column” with the
appropriate column letters or cell references.
6. Compare the calculated point-biserial correlation coefficient to a critical value from a t-distribution table to determine whether the correlation is statistically significant. The degrees of freedom
should be N - 2, where N is the sample size.
Sample data:
Gender
Math Score
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
85
73
92
68
78
89
95
82
79
74
88
67
91
85
82
91
76
83
89
70
94
76
87
63
92
84
90
78
Mastering Statistical Analysis with Excel
148
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
83
72
86
81
88
77
81
71
92
66
84
80
87
79
93
73
89
75
81
69
90
85
94
76
88
68
To calculate the Point-Biserial Correlation for the above data using Excel, you can follow these steps:
1. Enter your data into two columns in an Excel spreadsheet, with the gender variable in one column
and the math scores in another column.
2. Calculate the mean of the math scores by using the AVERAGE function in Excel. In the cell where
you want to display the mean, type =AVERAGE(B2:B51) and press Enter. This will calculate the
mean of the math scores.
3. Calculate the standard deviation of the math scores by using the STDEV.S function in Excel. In the
cell where you want to display the standard deviation, type =STDEV.S(B2:B51) and press Enter. This
will calculate the standard deviation of the math scores.
4. Use the IF function to create a new column that indicates whether each student is male or female.
In the first cell of the new column, type =IF(A2=”Male”,1,0) and press Enter. This will create a new
column with a 1 for male students and a 0 for female students.
5. Calculate the Point-Biserial Correlation by using the CORREL function in Excel. In the cell where
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
you want to display the correlation, type =CORREL(C2:C51,B2:B51) and press Enter. This will calculate the Point-Biserial Correlation between the gender variable and the math scores.
Correlation value displayed is -0.188. This proves that there is negative correlation between sex of
the student and the mark secured.
Image showing Data entered and calculation formula for calculating Mean of the marks secured
entered.
Mastering Statistical Analysis with Excel
150
Image showing Mean value of the marks scored calculated. Formula for calculating standard deviation of the marks secured is found to be entered
Image showing Standard deviation data calculated on pressing ENTER key
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the use of IF formula to covert Male and Female to 1 and 0 respectively
Image showing a new column created to covert Male to 1 and Female to 0. Correlation calculation
formula could be seen entered
Mastering Statistical Analysis with Excel
152
Image showing Correlation value calcualted as -0.188 This indicates Negative correlation
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
5. Phi Correlation: The Phi correlation measures the relationship between two dichotomous variables. It ranges from -1 to 1, with values close to -1 indicating a strong negative relationship, values
close to 1 indicating a strong positive relationship, and values close to 0 indicating no relationship.
Phi correlation is a type of correlation coefficient that measures the strength and direction of the
association between two binary variables. It is also known as the phi coefficient or the phi statistic.
Phi correlation is calculated by first creating a 2x2 contingency table that shows the frequency distribution of the two binary variables. The contingency table has two rows, one for each category of the
first variable (usually called “A” and “not A”), and two columns, one for each category of the second
variable (usually called “B” and “not B”). The contingency table looks like this:
B
not B
A
a
not A c
b
d
Where “a” represents the frequency of observations that are in both category A and category B, “b”
represents the frequency of observations that are in category not A and category B, “c” represents the
frequency of observations that are in category A and category not B, and “d” represents the frequency
of observations that are in category not A and category not B.
Phi correlation is calculated using the following formula:
phi = (ad - bc) / sqrt((a+b)(c+d)(a+c)(b+d))
Phi correlation can range from -1 to 1, with negative values indicating a negative association and
positive values indicating a positive association. A value of 0 indicates no association between the
two variables. The magnitude of the correlation coefficient indicates the strength of the association,
with larger values indicating a stronger association.
Here is an example of sample data for calculating phi correlation:
Suppose we have a dataset that contains information about the gender and smoking status of a group
of individuals. We want to determine if there is an association between gender and smoking status.
We can create a 2x2 contingency table to summarize the data as follows:
Smoker
Male
Female
20
10
Non-smoker
30
40
We can use this contingency table to calculate the phi correlation as follows:
1. Calculate a, b, c, and d:
Mastering Statistical Analysis with Excel
154
a = 20 (number of males who smoke)
b = 30 (number of males who do not smoke)
c = 10 (number of females who smoke)
d = 40 (number of females who do not smoke)
2. Calculate the sums of the rows and columns:
a+b = 50 (total number of smokers)
c+d = 70 (total number of non-smokers)
a+c = 30 (total number of males)
b+d = 70 (total number of females)
3. Calculate the phi correlation using the formula:
phi = (ad - bc) / sqrt((a+b)(c+d)(a+c)(b+d))
= ((2040) - (3010)) / sqrt((507030*70))
= 0.2357
Therefore, the phi correlation between gender and smoking status in this dataset is 0.2357, indicating
a weak positive association between the two variables.
You can use Excel to calculate the phi correlation of the above data using the following steps:
Enter the data into a 2x2 contingency table in Excel, with one row for each category of the first
variable (in this case, gender) and one column for each category of the second variable (in this case,
smoking status).
Calculate the totals for each row and column using Excel formulas. For example, you can use the
SUM function to calculate the totals for each row and column. In the example data, the totals for
each row and column are:
Male 20 30 50
Female 10 40 50
Total 30 70 100
3. Calculate a, b, c, and d using Excel formulas. In the example data, a = 20, b = 30, c = 10, and d =
40.
4. Calculate the phi correlation using the formula:
phi = (ad - bc) / sqrt((a+b)(c+d)(a+c)(b+d))
In Excel, you can use the following formula to calculate the phi correlation:
=(A2*D3-B2*C3)/SQRT((B3+D3)*(A3+C3)*(B3+C3)*(A3+D3))
This formula assumes that the contingency table is in cells A1:D3, with the categories of the first variProf. Dr Balasubramanian Thiagarajan MS D.L.O.
able in cells A2:A3 and the categories of the second variable in cells B1:D1.
5. Press Enter to calculate the phi correlation. The result should be 0.2357, as calculated in the previous answer.
Mastering Statistical Analysis with Excel
156
Descriptive
Statistics
10
escriptive statistics is a branch of statistics that involves the collection, analysis, and presentation of
data in order to describe and summarize a set of observations. It is concerned with the numerical
and graphical methods used to summarize and present the main features of a data set, such as measures
of central tendency (e.g., mean, median, mode) and measures of variability (e.g., range, variance, standard
deviation). Descriptive statistics can be used to gain insights into the characteristics of a population or
sample, to identify patterns and trends in data, and to provide a basis for further statistical analysis.
D
The components of descriptive statistics include:
1. Measures of central tendency: These are numerical measures that describe the center of a data set.
They include the mean, median, and mode.
2. Measures of variability: These are numerical measures that describe the spread or dispersion of a
data set. They include the range, variance, and standard deviation.
3. Frequency distributions: These are tables or graphs that show how often each value or range of values occurs in a data set.
4. Histograms: A graphical representation of the frequency distribution of a continuous variable,
divided into intervals (called bins) along the x-axis and showing the frequency or proportion of data
points in each bin.
5. Box plots: A graphical representation of the distribution of a continuous variable that displays the
median, quartiles, and outliers of the data.
6. Scatter plots: A graphical representation of the relationship between two variables, with each data
point plotted as a point on a two-dimensional coordinate system.
7. Measures of association: These are numerical measures that describe the strength and direction of
the relationship between two variables. They include correlation coefficients and regression analysis.
All of these components are used to summarize and describe the characteristics of a data set.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
Measures of Central Tendency:
Measures of central tendency are numerical measures that describe the center or typical value of a
data set. There are three commonly used measures of central tendency:
1. Mean: The arithmetic average of a set of values. It is calculated by adding up all the values and dividing by the total number of values.
2. Median: The middle value in a set of ordered values. It is the value that separates the upper and
lower halves of the data set.
3. Mode: The most frequent value in a data set. It is the value that occurs with the highest frequency.
Measures of central tendency are used to provide a general idea of the typical value of a data set. The
choice of which measure to use depends on the nature of the data and the purpose of the analysis. The
mean is typically used when the data is normally distributed and there are no extreme values (outliers). The median is used when the data has outliers or is skewed. The mode is used when the data is
categorical or nominal.
Mean
The mean is a measure of central tendency that represents the average value of a set of numbers. It
is calculated by adding up all the values in the data set and dividing the sum by the total number of
values. The mean is commonly denoted by the symbol “μ” for a population mean, or “x” for a sample
mean.
The formula for calculating the mean is:
μ = (Σx) / n
or
x- = (Σx) / n
where Σx is the sum of all the values in the data set and n is the total number of values.
The mean is a useful measure of central tendency because it takes into account all the values in the
data set. However, it can be affected by outliers or extreme values that are much larger or smaller than
the rest of the values. In such cases, the median or mode may be more appropriate measures of central
tendency.
Mastering Statistical Analysis with Excel
158
Using Excel to calculate Mean value of a dataset:
To calculate the mean using Excel, follow these steps:
1. Enter your data into a column of an Excel worksheet.
2. Click on an empty cell where you want to display the mean.
3. Type the following formula into the cell: “=AVERAGE(A1:A10)” (without quotes). This formula
assumes that your data is in cells A1 through A10. Replace these cell references with the actual range
of your data.
4. Press Enter on your keyboard. The mean will be calculated and displayed in the cell.
Alternatively, you can use the built-in AVERAGE function in Excel to calculate the mean. To do this:
1. Enter your data into a column of an Excel worksheet.
2. Click on an empty cell where you want to display the mean.
3. Type the following formula into the cell: “=AVERAGE(A1:A10)” (without quotes). This formula
assumes that your data is in cells A1 through A10. Replace these cell references with the actual range
of your data.
4. Press Enter on your keyboard. The mean will be calculated and displayed in the cell.
You can also use the AutoSum feature in Excel to calculate the mean:
1. Enter your data into a column of an Excel worksheet.
2. Click on an empty cell below the last value in the column.
3. Click the AutoSum button (Σ) on the Home tab of the Ribbon.
4. Excel will automatically select the range of cells containing your data. Press Enter on your keyboard.
5. The mean will be calculated and displayed in the cell.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing formula for calculating Mean entered into an empty cell
Mastering Statistical Analysis with Excel
160
Image showing Mean value displayed when Enter key is pressed after keying the formula in an empty
cell
Another way of calculating Mean value of a dataset is to use built in calculation function of Excel.
Look out for the Epsilon symbol (icon) on the right side of the top menu bar.
Select the cell where you want to display the mean value.
Click on the Epsilon icon to open up the submenu. In the submenu choose average. This will create
function within the selected cell. Now is the time to select the cells containing the numeric data. On
selecting the data it can be seen to be entered within the formula.
On clicking the Enter key the result will be displayed.
Calculating the mean is important in many areas, including statistics, finance, and science. Here are
some reasons why:
1. Understanding central tendency: The mean is a measure of central tendency, which helps to summarize the data and understand its distribution. It provides a single value that represents the average
of the data, making it easier to understand and analyze.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the Epsilon icon when when clicked opens up the various built in calculation functions
Image showing formula for calculating mean getting entered automatically when the cells are selected. On pressing Enter Key the mean value would be entered automatically.
Mastering Statistical Analysis with Excel
162
2. Making comparisons: The mean can be used to compare different groups of data. For example, if
you want to compare the average salary of two companies, you can calculate the mean for each company and compare the results.
3. Identifying outliers: Outliers are data points that are significantly different from the rest of the
data. Calculating the mean can help identify outliers, which can be important in detecting errors or
anomalies in the data.
4. Predicting values: In some cases, the mean can be used to predict future values. For example, if
you are analyzing stock prices, you can calculate the mean of past prices and use it to predict future
prices.
5. Evaluating performance: The mean can be used to evaluate performance, such as in sports or academics. For example, you can calculate the mean of a team’s scores and use it to evaluate their overall
performance.
Overall, the mean is a useful statistical tool that helps to summarize and analyze data, make comparisons, and identify outliers.
Median:
The median is a measure of central tendency that represents the middle value in a data set. It is the
value that separates the upper and lower halves of the data set. To find the median, the data must first
be sorted in ascending or descending order.
If the data set has an odd number of values, then the median is the middle value. For example, if the
data set is 3, 5, 7, 9, 11, then the median is 7.
If the data set has an even number of values, then the median is the average of the two middle values.
For example, if the data set is 2, 4, 6, 8, then the median is (4 + 6) / 2 = 5.
The median is a useful measure of central tendency when the data has outliers or is skewed. It is
not affected by extreme values in the same way as the mean. However, it can be less precise than the
mean since it only takes into account the middle value(s) of the data set.
To calculate the median using Excel, you can use the MEDIAN function.
Here are the steps:
1. Enter your data into a column in Excel.
2. Click on an empty cell where you want to display the median.
3. Type the formula =MEDIAN(A1:A10) where “A1:A10” represents the range of cells containing
your data. Replace it with the actual range you are using.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
4. Press “Enter” on your keyboard.
Excel will then calculate the median of your data and display the result in the cell you selected.
Image showing formula for calculating Median value of a dataset entered into an empty cell
Mastering Statistical Analysis with Excel
164
Image showing the value of Median entered when Enter button is pressed
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Mode:
In statistics, the mode is the value that appears most frequently in a dataset. It is a measure of central
tendency, like the mean and median, and provides information about the most common or typical
value in the dataset.
The mode can be useful in several ways:
1. Describing the data: The mode provides information about the most common value in the dataset,
which can be useful in describing the characteristics of the data.
2. Identifying trends: If the mode occurs more frequently than other values, it can indicate a trend or
pattern in the data.
3. Data analysis: The mode can be used in data analysis to help identify important features or variables in the dataset.
4. Comparing datasets: The mode can be used to compare different datasets and determine which
has the most similar distribution of values.
5. Quality control: The mode can be used in quality control to identify values that occur frequently
and may need to be checked for accuracy.
Overall, the mode is a useful statistical measure that provides information about the most common
value in a dataset. It can be used in a variety of ways, from describing the data to identifying trends
and analyzing the quality of data.
Using Excel to calculate Mode:
To calculate the mode using Excel, you can use the MODE function. Here are the steps:
1. Enter your data into a column in Excel.
2. Click on an empty cell where you want to display the mode.
3. Type the formula =MODE(A1:A10) where “A1:A10” represents the range of cells containing your
data. Replace it with the actual range you are using.
4. Press “Enter” on your keyboard.
Excel will then calculate the mode of your data and display the result in the cell you selected. If there
are multiple modes in your data set, the function will return the smallest one. If there is no mode in
the data set, the function will return an error message. Note that Excel’s MODE function only works
with numbers, not with text or logical values.
Mastering Statistical Analysis with Excel
166
Image showing formula to calculate mode entered
Image showing Mode value displayed when Enter Key is pressed
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Measures of variability:
Measures of variability are statistical measures that describe the spread or dispersion of a dataset.
These measures provide information about how far apart the data points are from the central tendency of the dataset. The three most common measures of variability are range, variance, and standard
deviation.
Range: The range is the difference between the largest and smallest values in a dataset. It provides a
simple way to assess the spread of the data, but it is sensitive to outliers.
Variance: The variance is a measure of how much the data deviates from the mean. It is calculated by
taking the average of the squared differences between each data point and the mean. A higher variance indicates greater variability in the data.
Standard deviation: The standard deviation is the square root of the variance. It provides a measure
of the spread of the data around the mean. A higher standard deviation indicates greater variability
in the data.
It is important to ascertain the measures of variability in a dataset because they provide additional
information about the data beyond the central tendency (mean, median, mode). Variability measures
help to assess how spread out the data is, which can be useful in determining the precision of the
data or in identifying outliers. In addition, measures of variability are used in many statistical analyses to test hypotheses or to calculate confidence intervals.
Range:
The range of a dataset is the difference between the largest and smallest values in the dataset. It is a
simple measure of variability that provides important information about the spread of the data. Here
are some reasons why ascertaining the range of a dataset is important:
1. It provides a quick and easy way to assess the spread of the data. The range can be calculated
quickly and easily, even for large datasets. This makes it a useful tool for getting a general sense of the
variability in the data.
2. It can help identify outliers. Outliers are data points that are much larger or smaller than the other
values in the dataset. They can skew the results of analyses, so it is important to identify and deal
with them appropriately. The range can help identify outliers because they will fall outside the range
of most of the other data points.
3. It can help in decision making. If the range of a dataset is small, it suggests that the data is relatively consistent and predictable. This can be useful information for decision making, such as in business
planning or forecasting.
4. It can aid in comparing datasets. The range can be used to compare the variability of two or more
datasets. For example, if the range of one dataset is much larger than another, it suggests that the first
dataset has more variability or is more spread out than the second dataset.
Mastering Statistical Analysis with Excel
168
In summary, the range of a dataset is an important measure of variability that provides valuable
information about the spread of the data. It is a useful tool for identifying outliers, making decisions,
and comparing datasets.
Using Excel to calculate the range in a dataset:
To calculate the range of a dataset using Excel, follow these steps:
1. Open Microsoft Excel and enter your data into a new worksheet.
2. Select an empty cell where you want to display the range.
3.Type the following formula into the cell: “=MAX(data range) - MIN(data range)”, where “data
range” is the range of cells that contains your data.
For example, if your data is in cells A1 through A10, the formula would be “=MAX(A1:A10) MIN(A1:A10)”.
4. Press “Enter” to calculate the range of your data.
Excel will return the difference between the maximum and minimum values in your dataset, which
is the range of your data.
Variance:
In statistics, variance is a measure of how spread out a set of data is from its mean or expected value.
It is the average of the squared differences from the mean.
More formally, the variance of a dataset with n observations is calculated as follows:
variance = (1/n) * Σ(xi - mean)^2
Where:
xi is the ith observation in the dataset.
mean is the mean or average of the dataset.
Σ is the sum of the values from i=1 to n.
Variance is an important statistical concept because it provides a way to quantify the variability or
dispersion of a dataset. The larger the variance, the more spread out the data is, while a smaller variance indicates that the data is more tightly clustered around the mean.
Variance is used in many statistical analyses, such as hypothesis testing, regression analysis, and
ANOVA. It helps researchers to understand how much the data varies from the mean and whether
differences between groups or variables are statistically significant.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing formula for calculating range in a given dataset is entered. On clicking Enter Key the
value gets displayed
Image showing range value displayed
Mastering Statistical Analysis with Excel
170
Calculating variance using Excel:
To calculate the variance of a set of data using Microsoft Excel, you can use the built-in function
VAR.S or VAR.P depending on whether you are calculating the sample variance or population variance. Here are the steps to calculate variance using Excel:
1. Enter your data into a column in an Excel worksheet.
2. In an empty cell, type “=VAR.S(“ followed by the range of cells containing your data enclosed
in parentheses. For example, if your data is in cells A1 to A10, your formula should look like this:
=VAR.S(A1:A10).
If you want to calculate the population variance, use the function VAR.P instead of VAR.S.
3. Press “Enter” to calculate the variance.
4. The result will be displayed in the cell where you entered the formula.
Note that Excel also has a function called STDEV.S (or STDEV.P) that calculates the standard deviation, which is simply the square root of the variance. So, if you want to calculate the standard deviation instead of the variance, you can use that function instead of VAR.S (or VAR.P).
Image showing formula for variance entered into an empty cell
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing value of variance displayed on pressing Enter key
Standard deviation:
In statistics, the standard deviation is a measure of how spread out a set of data is from its mean or
expected value. It is the square root of the variance, and is denoted by the symbol σ (sigma) for the
population standard deviation or s for the sample standard deviation.
The standard deviation measures the average amount by which each observation deviates from the
mean of the dataset. A small standard deviation indicates that the data points are tightly clustered
around the mean, while a large standard deviation indicates that the data points are more spread out
from the mean.
Standard deviation is important in statistics for several reasons:
1. It provides a way to quantify the variability of a dataset. By calculating the standard deviation, we
can understand how much the data varies from the mean and whether differences between groups or
variables are statistically significant.
2. It is used in many statistical tests, such as the t-test and ANOVA, to determine whether differences
between groups or variables are statistically significant.
Mastering Statistical Analysis with Excel
172
3. It is used in the calculation of confidence intervals, which provide a range of values within which
we can be reasonably confident that the true population mean falls.
4. It is used in the construction of many statistical models, such as linear regression, to help identify
patterns and relationships in the data.
Overall, the standard deviation is a fundamental statistical concept that is used to summarize and
analyze data in many different ways.
Using Excel to calculate standard deviation of a dataset:
To calculate the standard deviation of a set of data using Microsoft Excel, you can use the built-in
function STDEV.S or STDEV.P depending on whether you are calculating the sample standard deviation or population standard deviation. Here are the steps to calculate standard deviation using Excel:
1. Enter your data into a column in an Excel worksheet.
2. In an empty cell, type “=STDEV.S(“ followed by the range of cells containing your data enclosed
in parentheses. For example, if your data is in cells A1 to A10, your formula should look like this:
=STDEV.S(A1:A10).
If you want to calculate the population standard deviation, use the function STDEV.P instead of
STDEV.S.
3. Press “Enter” to calculate the standard deviation.
4. The result will be displayed in the cell where you entered the formula.
Note that Excel also has a function called VAR.S (or VAR.P) that calculates the variance. If you have
already calculated the variance using this function, you can calculate the standard deviation by taking the square root of the variance. For example, if your variance is in cell A11, you can calculate the
standard deviation in cell A12 by typing “=SQRT(A11)”.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing formula for calculating standard deviation of a dataset entered
Image showing standard deviation of a dataset displayed on pressing Enter Key
Mastering Statistical Analysis with Excel
174
Frequency distribution:
Frequency distribution is a way to summarize and present data in a tabular format that shows how
often each value or range of values occurs in a dataset. It displays the number of occurrences or frequency of each value, making it easier to visualize and analyze the data.
Frequency distribution can be calculated for both numerical and categorical data. For numerical
data, it is common to group the values into intervals or bins to make the table more manageable. For
categorical data, the frequency distribution simply lists the count or proportion of each category.
Frequency distribution is important because it provides valuable insights into the underlying patterns and characteristics of a dataset. By analyzing the distribution of values, you can identify the
central tendency (mean, median, mode) and the variability (range, standard deviation) of the data.
You can also identify any outliers or unusual values that may skew the analysis.
Overall, frequency distribution is a useful tool for summarizing and visualizing large amounts of
data, allowing you to gain a better understanding of the distribution of values and make more informed decisions.
Image showing formula for calculating frequency distribution entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing frequency distribution displayed on pressing the ENTER key
Mastering Statistical Analysis with Excel
176
Another way that can be followed to calculate frequency distribution of dataset using Excel is to use
the Pivot table feature.
Data is entered into the Excel spreadsheet.
Insert tab is clicked. It reveals few more tabs from which Pivot table tab is chosen.
From the Pivot table menu From Table / Range menu is chosen.
Image showing data entered into spreadsheet and Pivot table menu chosen
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
When the menu From Table/Range is chosen a new dialog box opens up.
In the dialog box in the Table/Range field cursor is placed and clicked. As soon as it starts to blink
the column containing data is selected. The selected cell address range is automatically entered in to
the field.
In the next field choose Existing worksheet by clicking on the radio button next to it.
In the location field place the cursor and click on it. As soon as the cursor starts to blink the cells
where the results need to be displayed are selected. On doing so the address is automatically entered
into the field.
In the new Pivot table fields Marks is drawn to the Rows and Values column as shown in the image
below.
Image showing the new Pivot table fields.
Mastering Statistical Analysis with Excel
178
In the values field sum of marks would be listed. Click on the down arrow. It will open up a menu.
In this menu value field settings should be chosen.
Image showing Value field settings menu
On clicking the Value Field settings a new window will open up.
In this window Summarize value field by Sum would be the default setting. This setting should be
changed to Count. Then OK button is clicked.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Value Field settings window where Count is chosen and OK button is pressed.
Image showing Pivot window displaying the count of each data.
Mastering Statistical Analysis with Excel
180
Cursor is placed over one data and right clicked. This opens up a submenu. In this submenu Group
is chosen.
Image showing Group Menu
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the Grouping window
In the ensuing Grouping window (shown above) both minimim and maximum values of the data set
are already found to be entered. By Default frequency of the data to be displayed is 10. This value
can be changed as per user’s preference. In this case the default value of 10 would suffice and hence
left unaltered. On pressing the OK button frequency of the values would be clearly displayed as
shown in the image below.
Image showing Frequency distribution table for the given set of data created.
Mastering Statistical Analysis with Excel
182
Histograms:
A histogram is a graphical representation of the distribution of a dataset. It shows the frequency of
observations that fall within specified ranges or “bins” of values.
Histograms are commonly used in data analysis to visualize the distribution of continuous data, such
as heights, weights, or test scores. They are particularly useful when working with large datasets because they allow you to see patterns and trends that might not be apparent in a table of numbers.
In addition to providing an overview of the data, histograms can also be used to identify outliers,
estimate the central tendency of the distribution, and assess the degree of skewness or asymmetry in
the data.
Histograms are frequently used in many fields, including statistics, finance, and data science. They
are often used in data preprocessing, exploratory data analysis, and data visualization, and are an
essential tool for understanding and interpreting data.
To create a histogram in Excel, you can follow these general steps:
1. Enter your data into a worksheet. Make sure the data is organized in a single column or row and
does not contain any empty cells.
2. Click on the “Insert” tab in the Excel ribbon.
3. In the “Charts” section, click on the “Histogram” icon.
4. Select the data range that you want to include in the histogram.
5. Click “OK” to create the histogram.
Here are the detailed steps:
1. Enter your data into a worksheet in a single column or row. For example, let’s say you have the
following heights in centimeters: 170, 172, 175, 178, 180, 182, 183, 185, 187, 190.
2. Click on the “Insert” tab in the Excel ribbon.
3. In the “Charts” section, click on the “Histogram” icon.
4. Select the data range that you want to include in the histogram. In this case, select the range containing the height data.
5. Click “OK” to create the histogram.
Excel will generate a histogram with default settings. You can modify the chart to fit your needs, such
as changing the bin width or adding axis labels. To modify the chart, simply click on the chart and
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
then use the “Chart Design” and “Format” tabs in the Excel ribbon.
Image showing data entered into a spreadsheet
Mastering Statistical Analysis with Excel
184
Image showing Recommneded charts tab
Image showing chart menu
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Histogram chosen as the chart type
Mastering Statistical Analysis with Excel
186
Image showing histogram generated
Histograms are a useful tool for visualizing the distribution of a dataset. Here are some advantages of
histograms:
1. Easy to interpret: Histograms are easy to interpret and understand. They provide a visual representation of the distribution of data and the frequency of occurrence of each value or range of values.
2. Identify patterns and trends: Histograms can help identify patterns and trends in data. They can
reveal important features of the data, such as the location and shape of the distribution, outliers, and
the spread of the data.
3. Useful for large datasets: Histograms are useful for large datasets because they can summarize the
data in a clear and concise manner. They can also help identify any sub-groups or clusters within the
data.
4. Facilitate data analysis: Histograms facilitate data analysis by providing a quick overview of the
data. They can be used to compare different datasets, to identify trends over time, or to analyze the
impact of different variables on the data.
5. Can be used with any data type: Histograms can be used with any type of data, whether it is continuous, discrete, or categorical.
6. Easy to create: Histograms are easy to create using most statistical software packages or programming languages. This means that they can be used by researchers, analysts, and students with little or
no prior experience in data visualization.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Here’s an example of another dataset that you can use to generate a histogram:
Suppose you want to visualize the distribution of the ages of students in a college:
Age
18
21
20
19
22
25
23
18
20
19
24
20
22
21
18
Using this dataset, you can create a histogram to visualize the frequency of different age groups. The
X-axis of the histogram represents the age groups, and the Y-axis represents the frequency or number of students in each age group. By analyzing the histogram, you can identify the most common
age group, the spread of the age groups, and any outliers or unusual patterns in the data.
The above data is entered into Excel spreadsheet in a column.
The Entire column is selected.
Click on the Insert tab.
It reveals some more tabs. The Recommended charts tab is clicked.
In the ensuing menu choose All charts tab.
In the All Charts menu choose Histogram which will reveal various types of histogram patterns that
can be used to present the selected data. Choose the most appropriate one. The chart gets generated.
Mastering Statistical Analysis with Excel
188
Image showing histogram for the age data generated. Note the data is organized into three age categories wherein all the students have been fitted.
Scatter Plots:
A scatter plot is a type of data visualization that displays the relationship between two continuous
variables. In a scatter plot, each data point is represented by a point on a two-dimensional graph,
where one variable is plotted on the X-axis and the other variable is plotted on the Y-axis.
Scatter plots are used in statistical analysis to explore the relationship between two variables and to
identify patterns or trends in the data. Specifically, scatter plots can be used to:
1. Identify trends: A scatter plot can reveal whether there is a positive or negative relationship between two variables. For example, if the points on the scatter plot tend to form a line that slopes
upward from left to right, it indicates a positive relationship between the two variables. If the points
tend to form a line that slopes downward, it indicates a negative relationship.
2. Identify outliers: Scatter plots can also help identify outliers or unusual data points that are far
away from the general pattern of the data.
3. Assess correlation: Scatter plots can be used to assess the strength of the relationship between two
variables by calculating the correlation coefficient. The correlation coefficient is a statistical measure
that quantifies the degree of association between two variables.
4. Compare groups: Scatter plots can be used to compare the relationship between two variables
across different groups. For example, you can create a scatter plot that shows the relationship between height and weight for men and women separately, and compare the patterns of the two groups.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Overall, scatter plots are a useful tool in statistical analysis because they provide a quick and easy way
to visualize the relationship between two variables and to identify patterns or trends in the data.
Sure, here’s an example dataset that you can use to generate a scatter plot:
Suppose you want to visualize the relationship between the number of hours studied and the exam
scores of a group of students:
Hours Studied Exam Score
2
60
3
70
5
85
1
45
4
80
6
90
3
65
2
55
4
75
5
85
Using this dataset, you can create a scatter plot to visualize the relationship between the number of
hours studied and the exam scores. The X-axis of the scatter plot represents the number of hours
studied, and the Y-axis represents the exam scores. Each point on the scatter plot represents one
student’s data.
To create a scatter plot in Excel, follow these steps:
1. Open Excel and enter your data into a new spreadsheet.
2. Select the range of data that you want to use for the scatter plot.
3. Click on the “Insert” tab in the ribbon at the top of the screen.
4. Click on “Scatter” in the “Charts” group.
5. Select the type of scatter plot that you want to create. For example, you can choose a simple scatter
plot with dots, or you can choose a scatter plot with lines connecting the dots.
Excel will create a scatter plot based on your data. You can customize the appearance of the scatter
plot by adding labels, titles, and other formatting options using the chart tools that appear on the
ribbon when you select the chart.
Mastering Statistical Analysis with Excel
190
Image showing Scatterplot generated
There are several advantages of using a scatterplot to visualize the relationship between two variables:
1. Identify patterns and trends: A scatterplot is a useful tool for identifying patterns and trends in the
relationship between two variables. By examining the scatterplot, you can quickly identify whether
the two variables are positively or negatively related, or if there is no relationship at all.
2. Visualize data distribution: Scatterplots allow you to visualize the distribution of data points for
two variables. This can help identify any outliers or unusual patterns in the data.
3. Display large datasets: Scatterplots can display large datasets easily, making it possible to visualize
thousands of data points on a single graph.
4. Compare groups: Scatterplots can be used to compare the relationship between two variables
across different groups. For example, you can create a scatterplot that shows the relationship between
height and weight for men and women separately, and compare the patterns of the two groups.
5. Assess correlation: Scatterplots can be used to assess the strength of the relationship between two
variables by calculating the correlation coefficient. The correlation coefficient is a statistical measure
that quantifies the degree of association between two variables.
6. Communicate findings: Scatterplots are an effective way to communicate findings to others. They
are easy to understand and can be used to convey complex information in a simple, visual format.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Overall, scatterplots are a versatile and powerful tool for visualizing relationships between two variables and exploring patterns and trends in large datasets.
While scatterplots are a useful tool for visualizing the relationship between two variables, there are
some scenarios where they may not be appropriate. Here are a few examples:
1. Categorical data: Scatterplots are designed to visualize the relationship between two continuous
variables. If one or both of the variables are categorical, a scatterplot may not be the best choice. In
these cases, a different type of chart, such as a bar chart or a pie chart, may be more appropriate.
2. Non-linear relationships: Scatterplots assume that the relationship between the two variables is linear, meaning that the relationship can be described by a straight line. If the relationship between the
variables is non-linear, such as a curved relationship or a relationship that involves more than two
variables, a scatterplot may not accurately represent the data.
3. Outliers: If there are a large number of outliers in the data, they can distort the pattern of the relationship between the variables. In these cases, it may be more appropriate to use a different type of
chart, such as a box plot or a histogram, to better understand the distribution of the data.
4. Missing data: If there are missing data points in the dataset, a scatterplot may not be appropriate.
In these cases, you may need to use statistical methods to impute missing values before creating a
scatterplot.
In summary, scatterplots are a powerful tool for visualizing the relationship between two continuous
variables, but there are some scenarios where they may not be the best choice. It’s important to carefully consider the data and the research question before choosing a data visualization method.
Box Plots:
A box plot, also known as a box and whisker plot, is a graphical representation of the distribution of
a dataset. It displays the median, quartiles, and outliers of the data in a compact and easy-to-interpret
format.
A box plot consists of a box that represents the interquartile range (IQR) of the data, which is the
range that includes the middle 50% of the dataset. The box is bounded by the lower and upper quartiles (Q1 and Q3, respectively), which mark the 25th and 75th percentiles of the data. The median,
which is the middle value of the dataset, is represented by a horizontal line inside the box. The “whiskers” of the plot extend from the box to the minimum and maximum values of the data that are not
outliers.
Box plots are used in a variety of settings to visually represent the distribution of a dataset. Some
common applications include:
1. Descriptive statistics: Box plots are often used to summarize the distribution of a dataset in a clear
and concise manner. They provide a quick and easy way to compare the central tendency and variability of different datasets.
Mastering Statistical Analysis with Excel
192
2. Outlier detection: Box plots can help identify outliers in a dataset that may be of interest or indicate errors in the data collection process.
3. Statistical analysis: Box plots can be used to compare the distribution of a variable across different
groups or conditions in a statistical analysis. This can help identify differences or similarities between
groups that may be of interest.
4. Quality control: Box plots are commonly used in quality control processes to monitor the variability of a production process over time. They can help identify changes in the variability of the process
that may indicate a problem.
Overall, box plots are a versatile and useful tool for summarizing the distribution of a dataset and
identifying patterns and outliers. They are widely used in statistics, data analysis, and quality control
applications.
Sample data for generating a box plot:
Let’s consider the following dataset:
10, 12, 13, 15, 16, 17, 19, 20, 21, 23, 25, 26, 27, 29, 30, 33, 35, 36, 40, 45
Steps for generating a box plot in Excel:
1. Enter the data into a column in an Excel spreadsheet.
2. Highlight the data and select the “Insert” tab at the top of the Excel window.
3. In the “Charts” group, select “Insert Statistic Chart” and then select “Box and Whisker” from the
dropdown menu.
4. A box plot will be generated, with the minimum, maximum, median, and quartiles displayed. If
you want to customize the box plot, you can right-click on it and select “Format Chart Area” to make
changes to the appearance or labels.
5. If you need to update the data in the box plot, simply modify the original data in the spreadsheet
and the chart will automatically update.
That’s it! With these simple steps, you can create a box plot in Excel to visualize the distribution of
your data.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing data entered. Entered data is selected and Box plot type of chart is selected from
Insert tab.
Image showing Box Plot generated
Mastering Statistical Analysis with Excel
194
Box plots are a useful tool for summarizing the distribution of a dataset, but there are some scenarios
where they may not be appropriate or useful. Here are a few examples:
1. Small sample sizes: Box plots are less informative when applied to small sample sizes because they
can be highly sensitive to outliers, and outliers can have a significant impact on the quartile ranges
used to construct the plot. In general, a sample size of at least 20 is recommended for constructing
reliable box plots.
2. Non-numeric data: Box plots are designed for use with numeric data. If your dataset contains
non-numeric data, such as categorical or ordinal data, a box plot may not be the best way to summarize it.
3. Skewed distributions: Box plots assume that the data are roughly symmetric, with equal amounts
of data above and below the median. If your data are highly skewed, with a long tail on one side, the
box plot may not accurately represent the distribution.
4. Extreme outliers: If your dataset contains extreme outliers, they can cause the box plot to be highly
distorted. In some cases, it may be more appropriate to remove outliers or to use a different visualization technique altogether.
5. Multiple modes: Box plots are designed for use with unimodal distributions, meaning distributions with a single peak. If your dataset contains multiple modes, or multiple distinct groups of data,
a box plot may not be the best way to represent it.
In general, it is always a good idea to explore multiple visualization techniques when summarizing
a dataset, and to choose the one that is most appropriate for your particular dataset and research
question.
Measures of Association:
Measures of association are statistical techniques used to quantify the strength and direction of
the relationship between two or more variables in a dataset. These measures can help to determine
whether two variables are related, and the nature of that relationship.
Here are some commonly used measures of association in statistics:
1. Correlation coefficient: The correlation coefficient is a measure of the strength and direction of the
linear relationship between two continuous variables. The coefficient can range from -1 to +1, with
values closer to -1 indicating a strong negative correlation, values closer to +1 indicating a strong
positive correlation, and values closer to 0 indicating a weak or no correlation.
2. Chi-squared test: The chi-squared test is used to determine whether there is a relationship between two categorical variables. The test compares the observed frequencies of each category to the
expected frequencies, and produces a chi-squared statistic that can be compared to a critical value to
determine statistical significance.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
3. Odds ratio: The odds ratio is used to measure the strength and direction of the association between two categorical variables, particularly in cases where one variable is binary (e.g. yes/no) and
the other variable has more than two categories. The odds ratio compares the odds of a particular
outcome occurring in one group to the odds of the same outcome occurring in another group.
4. Regression analysis: Regression analysis is a more complex measure of association that is used to
model the relationship between one dependent variable and one or more independent variables. Regression analysis can be used to determine the strength and direction of the relationship between the
variables, as well as to predict the value of the dependent variable based on the values of the independent variables.
These are just a few examples of measures of association in statistics. The choice of measure will depend on the nature of the data and the research question being investigated.
The correlation coefficient is a statistical measure that measures the strength and direction of the linear relationship between two continuous variables. It is denoted by the symbol “r” and ranges from
-1 to +1.
A value of -1 indicates a perfect negative correlation, meaning that as one variable increases, the
other variable decreases. A value of +1 indicates a perfect positive correlation, meaning that as one
variable increases, the other variable also increases. A value of 0 indicates no correlation, meaning
that there is no linear relationship between the two variables.
The calculation of the correlation coefficient involves several steps:
1. Calculate the mean of each variable.
2. Calculate the standard deviation of each variable.
3. Calculate the covariance between the two variables.
4. Divide the covariance by the product of the two standard deviations.
The resulting value will be the correlation coefficient, with a value of -1 indicating a perfect negative
correlation, a value of +1 indicating a perfect positive correlation, and a value of 0 indicating no
correlation.
The correlation coefficient can be useful in many applications, such as determining whether there is
a relationship between two variables, identifying trends in data, and predicting future values of one
variable based on the values of another variable. However, it should be noted that correlation does
not imply causation, and other factors may be responsible for any observed relationship between the
variables.
Mastering Statistical Analysis with Excel
196
Here is a sample data set for calculating the correlation coefficient:
X: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Y: 5, 7, 8, 9, 11, 12, 13, 14, 16, 17
To calculate the correlation coefficient using Excel, follow these steps:
1. Enter the data into two columns in an Excel worksheet.
2. Select an empty cell where you want to display the correlation coefficient.
3. Type the formula “=CORREL(X,Y)” into the selected cell, where “X” and “Y” represent the cell
ranges containing the two sets of data.
4. Press Enter to calculate the correlation coefficient.
Excel will then calculate the correlation coefficient between the two variables, which should be approximately 0.968 in this case.
Image showing formula for calculating correlation coefficient entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing correlation coefficient for the data set displayed on pressing ENTER key
Chi-Squared T test:
The chi-squared test is a statistical test used to determine whether there is a significant association
between two categorical variables. It is used to test the null hypothesis that there is no association
between the variables, and the alternative hypothesis that there is a significant association.
The chi-squared test works by comparing the observed frequencies of each category in the two
variables with the expected frequencies, which are calculated based on the assumption that there is
no association between the variables. The test produces a chi-squared statistic, which measures the
difference between the observed and expected frequencies, and a p-value, which indicates the probability of obtaining the observed results if the null hypothesis is true.
If the p-value is less than a predetermined significance level (usually 0.05), the null hypothesis is
rejected, and it is concluded that there is a significant association between the variables.
The chi-squared test can be used in many applications, such as in medical research to test the association between a disease and a risk factor, in social science research to test the association between
demographic variables, or in market research to test the association between customer preferences
and product features.
Mastering Statistical Analysis with Excel
198
In summary, the chi-squared test is a powerful statistical tool used to analyze categorical data, to determine whether there is an association between two variables, and to test hypotheses about population parameters
Here is a sample data set for using the chi-squared test:
Suppose you are interested in determining whether there is a significant association between gender
and smoking status among a group of 100 people. The data is presented in a contingency table, as
follows:
Image showing the sample data set
To perform a chi-squared test on this data using Excel, follow these steps:
1. Enter the data into a contingency table in an Excel worksheet.
2. Highlight the entire table, including the row and column totals.
3. Click on the “Insert” tab in the Excel ribbon.
4.Click on “Recommended Charts” and select “All Charts.”
5. In the “Charts” tab, select the “Statistical” category and choose the “Clustered Column - Line”
chart.
Click on the chart to select it and then click on the “+” icon in the top right corner.
Check the box next to “Chi Squared Test.”
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
6. The output will appear in a new sheet, including the chi-squared statistic, degrees of freedom, and
p-value.
In this example, the chi-squared statistic is 7.143, with 1 degree of freedom and a p-value of 0.0075.
Since the p-value is less than 0.05, we can reject the null hypothesis and conclude that there is a significant association between gender and smoking status in the population.
Odds Ratio:
Odds ratio (OR) is a statistical measure that compares the odds of an event occurring in one group
to the odds of the same event occurring in another group. It is commonly used in epidemiology,
medical research, and other fields to investigate the association between a risk factor or exposure and
a disease or outcome.
The odds ratio is calculated as the ratio of the odds of an event occurring in the exposed group to
the odds of the same event occurring in the unexposed group. The odds of an event occurring can be
calculated as the number of people who experience the event divided by the number of people who
do not experience the event.
For example, suppose a study is conducted to investigate the association between smoking and lung
cancer. The odds of developing lung cancer among smokers might be calculated as the number of
smokers who develop lung cancer divided by the number of smokers who do not develop lung cancer. The odds of developing lung cancer among non-smokers might be calculated in the same way.
The odds ratio is a useful measure because it allows researchers to quantify the strength of the association between a risk factor or exposure and a disease or outcome. An odds ratio greater than 1
indicates that the risk factor or exposure is associated with an increased risk of the disease or outcome, while an odds ratio less than 1 indicates that the risk factor or exposure is associated with a
decreased risk of the disease or outcome.
Sure, here’s an example of how to calculate odds ratio in Excel:
Suppose we have the following data from a study investigating the association between smoking and
lung cancer:
Smokers
Non-Smokers
Total
Lung Cancer
50
10
60
No Lung Cancer
150
390
540
Total
200
400
600
To calculate the odds ratio of developing lung cancer for smokers compared to non-smokers, we
would follow these steps:
Step 1: Calculate the odds of developing lung cancer for smokers and non-smokers.
The odds of developing lung cancer for smokers is 50/150 = 0.33.
Mastering Statistical Analysis with Excel
200
The odds of developing lung cancer for non-smokers is 10/390 = 0.03.
Step 2: Calculate the odds ratio.
The odds ratio of developing lung cancer for smokers compared to non-smokers is 0.33/0.03 = 11.
Therefore, the odds of developing lung cancer are 11 times higher for smokers compared to
non-smokers in this study.
To calculate odds ratio in Excel, you can use the following steps:
Step 1: Enter your data into Excel, including the number of cases and non-cases in each group.
Step 2: Calculate the odds of the outcome occurring in each group using the formula odds = cases/
non-cases.
Step 3: Calculate the odds ratio using the formula odds ratio = odds of exposed group/odds of unexposed group.
Step 4: Interpret the results based on the calculated odds ratio. If the odds ratio is greater than 1, it
indicates that the exposure is associated with an increased risk of the outcome. If the odds ratio is
less than 1, it indicates that the exposure is associated with a decreased risk of the outcome.
Regression Analysis:
Regression analysis is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It is commonly used in fields such as economics, social
sciences, and business to investigate the relationship between variables and to make predictions
about future values of the dependent variable.
The goal of regression analysis is to find the best-fitting line or curve that describes the relationship
between the dependent variable and one or more independent variables. The line or curve is called
a regression equation, and it can be used to predict the value of the dependent variable for a given
value of the independent variable(s).
There are two main types of regression analysis: simple linear regression and multiple linear regression. Simple linear regression involves only one independent variable, while multiple linear regression involves two or more independent variables.
The regression equation is typically represented in the form of Y = a + bX, where Y is the dependent
variable, X is the independent variable, a is the intercept (the value of Y when X is 0), and b is the
slope (the change in Y for each unit change in X).
To perform regression analysis, you would typically start by collecting your data and entering it into
a statistical software program, such as Excel or SPSS. Then, you would choose the appropriate regression model based on the number of independent variables and the type of relationship you expect
between the variables. Finally, you would analyze the results of the regression analysis to determine
the strength of the relationship between the variables and to make predictions about future values of
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
the dependent variable based on different values of the independent variable(s).
Regression Analysis:
Sure, here’s an example of how to perform regression analysis in Excel:
Suppose we have the following data that shows the relationship between the number of hours studied
and the exam score for a group of students:
Hours Studied Exam Score
2
3
4
5
6
7
60
65
75
80
85
90
To perform regression analysis on this data, we would follow these steps:
Step 1: Enter your data into Excel, with the independent variable (hours studied) in one column and
the dependent variable (exam score) in another column.
Step 2: Create a scatter plot of the data by selecting the data range and clicking on the “Insert” tab,
then selecting “Scatter” and choosing the “Scatter with Straight Lines and Markers” option.
Step 3: Calculate the regression equation by selecting the data range and clicking on the “Data Analysis” tab. Choose “Regression” from the list of analysis tools, and enter the range of your independent
and dependent variables. Make sure to check the “Labels” box if you have column headers in your
data. Select “Output Range” and enter the cell where you want the regression output to appear.
Step 4: Interpret the results of the regression analysis. The output will show the equation of the
regression line, as well as the R-squared value, which measures how well the regression line fits the
data. The R-squared value ranges from 0 to 1, with higher values indicating a better fit. You can also
use the regression equation to predict the exam score for a given number of hours studied.
In this example, the regression equation is Y = 53.33 + 7.5X, where Y is the exam score and X is the
number of hours studied. The R-squared value is 0.98, indicating a strong positive relationship between hours studied and exam score.
That’s how you can perform regression analysis in Excel.
Mastering Statistical Analysis with Excel
202
Image showing Data entered into Excel and scatterplot created
Image showing Data Analysis menu where Regression is chosen. Data Analysis tab can be found on
clicking Data tab.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Regression Analysis window.. Input y range field is filled with cell address containing data to be plotted against y Range, and Input x range field is filled with cell address containing
data to be plotted against x range. If labels are inlcuded in selection the box infront of Labels should
be checked. It is ideal to create a new worksheet to display the result by clicking on New work sheet
radio button.
Image showing the result displayed in a new worksheet.
Mastering Statistical Analysis with Excel
204
Using Descriptive statistics Function of Excel for data analysis:
Excel has predefined function to use Descriptive statistics to analyize dataset. All the important
components of Descriptive statistics are available under the Descriptive analysis menu listed under
Data Analysis tab. In the earlier chapter the process of installation data analysis tool to Excel has
been described. In order to utilize this function this plug in needs to be installed.
The first step in this process is data entry. The data that needs to be analyzed is entered into the Excel
spread sheet column as shown in the image below.
Image showing Data entered into spreadsheet in a column
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Data Analysis tab that needs to be clicked to bring out Data Analysis menu in which
Descriptive statistics is selected and OK button is pressed.
Image showing Descriptive statistics window
Mastering Statistical Analysis with Excel
206
In the Descriptive Statistics window shown above the mouse cursor is placed over Input field. As
soon as it starts to blink the dataset that needs to be analyzed is selected. the user can select the data
set along with its header. If the header is included in the selection the button in front of Labels in
First Row is selected. The cursor is next placed in the output range field. and clicked. When the
cursor starts to blink the cells where the results need to be displaed are selected. the same would be
displayed in the field. The other boxes are selected as shown in the figure above. On clicking on the
OK button the result would be displayed in the cell selected as output range.
Image showing the results of Descriptive Statistics
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Mastering Statistical Analysis with Excel
208
11
Chi-Square Test
T
he chi-square test is a statistical hypothesis test that is used to determine whether there is a significant difference between observed and expected frequencies of two or more categorical variables.
The test involves calculating the sum of the squared differences between the observed frequencies and
the expected frequencies, divided by the expected frequencies. The resulting statistic, called the chisquare statistic (χ²), is compared to a critical value based on the degrees of freedom and the desired
level of significance.
If the calculated chi-square value is greater than the critical value, then the null hypothesis (that there
is no significant difference between the observed and expected frequencies) is rejected, indicating that
there is a significant difference between the observed and expected frequencies. If the calculated chisquare value is less than the critical value, then the null hypothesis cannot be rejected.
The chi-square test is commonly used in fields such as social science, biology, and marketing research
to analyze survey data, to determine whether there are significant differences between groups or variables, and to test the goodness-of-fit of models to data.
There are two main types of chi-square tests: the chi-square test of independence and the chi-square
goodness-of-fit test.
1. Chi-Square Test of Independence:
The chi-square test of independence is used to determine whether there is a significant association
between two categorical variables. This test is used to examine whether the two variables are independent or not. The test involves comparing the observed frequency distribution to the expected frequency distribution. The expected frequency distribution is calculated based on the assumption that the
two variables are independent.
2. Chi-Square Goodness-of-Fit Test:
The chi-square goodness-of-fit test is used to determine whether a sample data follows a specific theoretical distribution. The test involves comparing the observed frequency distribution to the expected
frequency distribution. The expected frequency distribution is calculated based on a specific theoretical distribution, such as the normal distribution or the Poisson distribution. The test is used to deterProf Dr Balasubramanian Thiagarajan MS D.L.O
mine whether the observed data fits the expected distribution.
In addition to these two main types, there are other variations of the chi-square test, such as the chisquare test for homogeneity and the McNemar test, which are used in specific situations where there
are more than two categorical variables involved.
Here is an example of using the chi-square test of independence with sample data:
Suppose we want to investigate the relationship between gender and favorite color among a group of
100 people. The data collected is summarized in the following table:
Male Female Total
Red
10
20
30
Blue
20
25
45
Green 15
10
25
Total 45
55 100
To perform the chi-square test of independence in Excel, we can follow these steps:
1. Enter the data into an Excel worksheet. In this example, we will enter the data into cells A1:D5.
2. Calculate the expected values for each cell in the table. To do this, we will use the formula:
(row total x column total) / grand total
3. Calculate the chi-square statistic by using the formula:
Σ [(observed - expected)² / expected]
4. Determine the degrees of freedom (df) for the test. The df is calculated as:
(number of rows - 1) x (number of columns - 1)
5. Determine the p-value of the test using a chi-square distribution table or the Excel function CHISQ.
DIST.RT.
Here is how to perform these steps in Excel:
1. Enter the data into an Excel worksheet. In this example, we will enter the data into cells A1:D5.
2. Calculate the expected values for each cell in the table. To do this, we will use the formula:
(row total x column total) / grand total
3. Enter the following formula in cell B2 and copy it across the table:
=(A2/$D$5)*$B$5
Mastering Statistical Analysis with Excel
210
4. Calculate the chi-square statistic by using the formula:
Σ [(observed - expected)² / expected]
5. Enter the following formula in cell E2 and copy it across the table:
=SUMSQ(A2-B2)/B2
6. Determine the degrees of freedom (df) for the test. The df is calculated as:
(number of rows - 1) x (number of columns - 1)
In this case, df = (2-1) x (3-1) = 2.
7. Determine the p-value of the test using the Excel function CHISQ.DIST.RT. Enter the following
formula in cell E5:
=CHISQ.DIST.RT(E2,2)
This will give you the p-value for the chi-square test.
In this example, the chi-square statistic is 10.21, the degrees of freedom are 2, and the p-value is
0.006. Since the p-value is less than the significance level of 0.05, we can reject the null hypothesis
and conclude that there is a significant association between gender and favorite color.
Example data for Chi-square Test:
100 Males and 100 Female volunteers were questioned pertaining to their smoking habits. They were
classified into two groups i.e., (Smokers and Non-Smokers). The intention of the study is to look
out for the presence or absence of association of Gender variable with that of smoking. This can be
achieved by performing Chi-Square test of independence.
The first step would be to enter the data into the Spreadsheet as rows and columns.
The next step would be to calculate the sum of both the rows and columns. This can be performed in
Excel using =Sum(cell address+cell address).
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the data entered into spreadsheet in rows and columns
Image showing the formula entered to calculate the total number of smokers both male and female
Mastering Statistical Analysis with Excel
212
Image showing the total number of smokers and non-smokers calculated
Image showing column and row total entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Expected value of smokers and non-smokers that need to be calculated.
Formula used is Expected value = (Row total x Column total)/ Overall total
Image showing formula entered to calculate the expected number of smokers in the population. On
pressing ENTER key the number 22.5 would be displayed.
Mastering Statistical Analysis with Excel
214
Image showing formula entered to calculate the expected number of non-smokers. On pressing the
Enter Key the value would be displayed inside the cell as shown below.
Image showing the expected number of non-smokers displayed
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing expected value of both smokers and non-smokers calculated
In the next step the following formula should be used:
(Observed value - Expected value)2/ Expected value
This value should be calculated for both smokers and non-smokers.
Image showing the calculation
Mastering Statistical Analysis with Excel
216
Image showing calculation completed for smokers
Image showing calculations complete for both smokers and non-smokers. This can easily be completed by pulling the handle (red circle) in horizontal or vertical direction as the need may be.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing X2 calculated using the formula entered
Image showing
X2 value displayed on pressing Enter key
Mastering Statistical Analysis with Excel
218
Image showing formula for calculating P value entered into a cell
Image showing P value displayed on pressing Enter key
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
The following formula is used to calculate df:
df = (Number of rows - 1) X (Number of columns - 1)
Result:
Before starting to analyze the result the Null hypothesis and Alternate hypothesis should be identified.
Null hypothesis: There is no association between gender and smoking status.
Alternative hypothesis: There is a association between gender and smoking status.
Result:
Null hypothesis is rejected if p>0.05
Null hypothesis is not rejected if p<0.05
Since thee P-value in this example is 0.027712 which is less than 0.05 Null hypothesis is rejected.
Mastering Statistical Analysis with Excel
220
Exponential
Smoothing
12
E
xponential smoothing is a statistical method used for analyzing time series data. It is a forecasting
technique that is used to make future predictions based on past data. The method works by assigning exponentially decreasing weights to past observations, with more recent observations receiving higher
weights than older ones.
The basic idea of exponential smoothing is to estimate the next data point in a time series as a weighted average of all past data points, with the weights decreasing exponentially as the observations get
older. The weights are determined by a smoothing parameter, which controls the rate at which the
weights decrease. A smaller smoothing parameter gives more weight to past observations, while a
larger smoothing parameter gives more weight to recent observations.
Exponential smoothing is particularly useful when the time series exhibits a trend or a seasonality.
There are several variations of exponential smoothing, including simple exponential smoothing, double exponential smoothing, and triple exponential smoothing, also known as the Holt-Winters method. These variations differ in how they incorporate trend and seasonality components into the model.
Criteria for using Exponential smoothing:
Exponential smoothing is a statistical method that can be used for time series forecasting. Here are
some criteria for using exponential smoothing:
1. Stationarity: The time series data should be stationary, which means that its statistical properties
such as mean, variance, and autocorrelation remain constant over time. If the data is not stationary,
it may require pre-processing such as differencing or transformation before applying exponential
smoothing.
2. Absence of Outliers: Exponential smoothing assumes that the time series data does not have any
outliers. Outliers are extreme values that are significantly different from the other values in the data.
Outliers can affect the forecast accuracy and may need to be removed or corrected before applying
exponential smoothing.
3. Consistency of the data: Exponential smoothing assumes that the data is consistent over time. This
means that the pattern of the data should remain consistent and stable over time. If there are significant changes in the pattern of the data, such as sudden spikes or drops, it may require adjustments or
special considerations.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
4. Sufficient historical data: Exponential smoothing requires a sufficient amount of historical data to
make accurate forecasts. The amount of historical data needed depends on the level of complexity of
the model and the frequency of the data.
5. Understanding of the underlying data: It is important to have a good understanding of the underlying data and its trends before applying exponential smoothing. This helps in choosing the appropriate
smoothing parameters and detecting any anomalies or outliers that may affect the forecast accuracy.
Performing Exponential smoothing using Excel:
Exponential smoothing can be performed in Excel using the built-in functions. Here are the steps to
perform exponential smoothing in Excel:
1. Enter the historical data into a column in Excel.
2. Calculate the initial value for the smoothed data. This can be done by taking the average of the first
few data points in the historical data.
3. Use the Exponential Smoothing function in Excel to calculate the smoothed values. The Exponential Smoothing function is included in the Data Analysis Toolpak add-in. To use this function, go to
the Data tab, click on Data Analysis, and select Exponential Smoothing from the list.
4. In the Exponential Smoothing dialog box, select the input range for the historical data, the output range for the smoothed data, and the smoothing parameter (alpha). The alpha value is a number
between 0 and 1 that determines the rate at which the weights decrease. A smaller alpha value gives
more weight to past observations, while a larger alpha value gives more weight to recent observations.
5. Click OK to calculate the smoothed data. The results will be displayed in the output range that was
selected in step 4.
6. Optionally, you can create a chart to visualize the historical data and the smoothed data.
Note: If the Data Analysis Toolpak is not installed in Excel, it can be installed by going to the File
tab, selecting Options, selecting Add-Ins, and clicking on the Excel Add-ins drop-down menu. Then,
select the Data Analysis Toolpak and click OK.
Here’s a sample dataset that you can use to perform exponential smoothing:
Time Period
1
2
3
4
5
6
7
Data
12
18
22
28
31
35
40
Mastering Statistical Analysis with Excel
222
To use Excel for exponential smoothing, follow these steps:
1. Enter the sample data into an Excel spreadsheet, with one column for the time periods and another column for the data.
2. Calculate the initial value for the smoothed data. For simple exponential smoothing, the initial
smoothed value is simply the first data point in the dataset. In this example, the initial smoothed
value is 12.
3. Click on the “Data” tab in Excel and select “Data Analysis” in the “Analysis” group.
4. In the “Data Analysis” dialog box, select “Exponential Smoothing” and click “OK”.
5. In the “Exponential Smoothing” dialog box, enter the input range for the data (in this example,
A2:B8), the output range for the smoothed data (e.g., D2:D8), and the smoothing constant (alpha).
For simple exponential smoothing, alpha typically ranges from 0.1 to 0.3, and a value of 0.2 is commonly used. In this example, we will use alpha=0.2.
6. Click “OK” to generate the smoothed data. The smoothed data will be displayed in the output
range that was selected in step 5.
7. Optionally, you can create a chart to visualize the historical data and the smoothed data. To do
this, select the data range (e.g., A2:D8), click on the “Insert” tab in Excel, select the desired chart
type, and follow the chart wizard to customize the chart as desired.
That’s it! You have now performed exponential smoothing using Excel on a sample dataset. You can
modify the input data and smoothing constant to experiment with different scenarios and see how
the smoothed data changes.
Image showing data entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Analysis Tools menu from where Exponential Smoothing is chosen.
Mastering Statistical Analysis with Excel
224
As shown in the above image the cursor is placed in the input range field. When the cursor starts
to blink the data columns are chosen the address of the same will be automatically entered into this
field. If the Labels were also chosen then the box before Label menu should be ticked. If not it is
left unchecked.
The damping factor field is specified as 0.2. The range for this field is between 0.2 to 0.3.
In the output range field the cursor is placed and clicked. When the cursor starts to blink the cells
where the user wants the results to be displayed are selected. On selection the specific cell addresses can be found automatically entered into this field.
The chart output box should be checked if the user desires to create a chart. On clcking the OK
button the result and the chart would be displayed in the output cells selected.
Image showing the results of Exponential smoothing displayed
To interpret the results of exponential smoothing, you can compare the smoothed data to the
original data to see how well the smoothed data represents the underlying trend. In this example,
we can see that the smoothed data generally follows the upward trend of the original data, but with
less variability. This is because exponential smoothing places more weight on recent data points,
which smooths out any short-term fluctuations and highlights the underlying trend. The resulting
smoothed data can be useful for forecasting future values, as it captures the underlying trend while
filtering out any noise or randomness in the data.
Exponential smoothing is a widely used statistical technique that is applied to a dataset for the
following reasons:
1. Smoothing: Exponential smoothing is used to smooth out fluctuations in a time series dataset.
It helps to remove any short-term variations in the data and highlight the underlying trend. This
makes it easier to identify patterns and trends in the data, and provides a more accurate representation of the underlying behavior of the time series.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
2. Forecasting: Exponential smoothing is also used for forecasting future values in a time series
dataset. By smoothing out the historical data and identifying the underlying trend, exponential
smoothing can be used to make predictions about future values. This is particularly useful when
there is uncertainty or variability in the data, as exponential smoothing can provide a more reliable
estimate of future values.
3. Comparing and evaluating models: Exponential smoothing can be used to compare and evaluate different forecasting models. By applying different smoothing parameters and comparing the
resulting smoothed data to the original data, it is possible to determine which model provides the
best fit for the data. This can help to improve the accuracy of future forecasts and reduce errors.
4. Decision making: Exponential smoothing can be used to make informed decisions based on
historical data. By analyzing trends and patterns in the data, it is possible to identify opportunities
for improvement and make data-driven decisions. This is particularly useful in industries such as
finance, supply chain management, and healthcare, where accurate forecasts and data-driven decision making are critical to success
Mastering Statistical Analysis with Excel
226
13
F-Test Two-sample for Variances
T
he F-test two-sample for variances is a statistical test that is used to compare the variances of two
independent samples. The test is based on the F-distribution and is used to determine whether the
variances of the two samples are significantly different from each other.
The F-test two-sample for variances is commonly used in experimental studies where two groups are
compared to determine whether there is a significant difference in the variability of their measurements. For example, in a study comparing the effectiveness of two different drugs, the F-test two-sample for variances can be used to determine whether the variability of the response to the drugs is
significantly different between the two groups.
The null hypothesis of the F-test two-sample for variances is that the variances of the two populations
are equal. The alternative hypothesis is that the variances are not equal. The test statistic for the F-test
is calculated as the ratio of the sample variances:
F = s1^2 / s2^2
where s1^2 is the variance of the first sample and s2^2 is the variance of the second sample. The F-statistic is then compared to a critical value obtained from the F-distribution with degrees of freedom
equal to n1-1 and n2-1, where n1 and n2 are the sample sizes of the two groups.
If the calculated F-statistic is greater than the critical value, then the null hypothesis is rejected, and it
can be concluded that the variances of the two populations are significantly different. If the calculated
F-statistic is less than or equal to the critical value, then the null hypothesis cannot be rejected, and it
is assumed that the variances of the two populations are equal.
The F-test is a statistical test that is used to compare the variances of two populations or the significance of the overall fit of a multiple regression model. It is named after its inventor, Sir Ronald A.
Fisher.
The F-test is based on the F-statistic, which is the ratio of two variances or sums of squares, and it
follows an F-distribution under the null hypothesis. The null hypothesis is that the two populations
have equal variances or that the multiple regression model does not provide a better fit than a simpler
model.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
The F-test is commonly used in ANOVA (analysis of variance) to compare the means of two or more
groups, and it can also be used in regression analysis to test the significance of the overall model or the
contribution of individual predictors to the model.
In summary, the F-test is a powerful statistical tool that helps to determine whether the differences
between two groups or the overall fit of a model are statistically significant.
Here’s an example of a sample data set for the F-test two-sample for variances analysis:
Group 1: 5, 8, 7, 9, 6
Group 2: 4, 2, 3, 5, 1
To perform the F-test two-sample for variances analysis in Excel, you can follow these steps:
1. Open Microsoft Excel and enter the data for both groups into two columns. In this example, enter
the data for Group 1 into column A and the data for Group 2 into column B.
2. Calculate the sample variances for both groups using the VAR.S function. In cell C1, enter the
formula “=VAR.S(A1:A5)” and press Enter. This will calculate the sample variance for Group 1. In cell
D1, enter the formula “=VAR.S(B1:B5)” and press Enter. This will calculate the sample variance for
Group 2.
3. Calculate the F-test statistic by dividing the larger sample variance by the smaller sample variance.
In cell E1, enter the formula “=MAX(C1,D1)/MIN(C1,D1)” and press Enter. This will calculate the
F-test statistic for the two groups.
4. Determine the degrees of freedom for the F-test using the COUNT function. In cell F1, enter the
formula “=COUNT(A1:A5)-1” and press Enter. This will calculate the degrees of freedom for Group
1. In cell G1, enter the formula “=COUNT(B1:B5)-1” and press Enter. This will calculate the degrees
of freedom for Group 2.
5. Calculate the p-value for the F-test using the F.DIST.RT function. In cell H1, enter the formula “=F.
DIST.RT(E1,F1,G1)” and press Enter. This will calculate the p-value for the F-test.
6. Interpret the results by comparing the p-value to the significance level (e.g. 0.05). If the p-value is
less than the significance level, you can reject the null hypothesis and conclude that the variances of
the two groups are significantly different. If the p-value is greater than the significance level, you fail to
reject the null hypothesis and conclude that there is not enough evidence to support the claim that the
variances of the two groups are different.
Mastering Statistical Analysis with Excel
228
Image showing data entered into spreadsheet
Image showing variance of first group calculated. Note the formula entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing variance for Group 2 calculated. Formula could be seen entered
Image showing variance of both groups calculated
Mastering Statistical Analysis with Excel
230
Image showing F test statistic calculation formula entered for Group 1. On pressing Enter Key the
value would displayed
Image showing F test statistic calculation formula entered for Group 1.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing F value calculated
Image showing calculation of degree of freedom entered
Mastering Statistical Analysis with Excel
232
Example 2:
Here two groups of data are taken into consideration. Group A and Group B.
The data is entered into Excel spread sheet as shown below in two columns.
Image showing data entered into two columns
Calculation of sample variance for both these groups.
Formula used to calculate variance of Group A : =VAR.S(A2:A11)
Formula used to calculate variance of Group B : = VAR.S(B2:B11)
Calculate the F-statistic by dividing the larger sample variance by the smaller sample variance. In this
case, we would use the formula =VAR.S(B2:B11)/VAR.S(A2:A11).
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing formula for calculating variance of Group A. On pressing Enter key the data will be
displayed
Image showing variance of Group A entered and formula for calculating variance of Group B entered
Mastering Statistical Analysis with Excel
234
Image showing variance of both Group A and B displayed in column D.
Image showing formula for calculating F statistic displayed. On pressing Enter Key the F value
would be displayed
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing F value displayed (Red circle)
Calculate the degrees of freedom for each group using the COUNT function minus one. In this case,
we would use COUNT(A2:A11)-1 for Group A and COUNT(B2:B11)-1 for Group B.
Calculate the degrees of freedom for the F-distribution by subtracting one from each of the degrees
of freedom from step 4, then using the MIN function to find the smaller value. In this case, we would
use =MIN(COUNT(A2:A11)-1,COUNT(B2:B11)-1).
Mastering Statistical Analysis with Excel
236
Image showing Degrees of Freedom for both Group A and B displayed.
Image showing calculating degrees of freedom for F-distribution
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Use the F.DIST.RT function to find the p-value for the F-statistic. In this case, we would use =F.DIST.
RT(F-statistic, degrees of freedom numerator, degrees of freedom denominator).
Interpret the results by comparing the p-value to the significance level. If the p-value is less than the
significance level, then we reject the null hypothesis that the variances are equal. If the p-value is
greater than the significance level, then we fail to reject the null hypothesis.
Image showing formula to calculate F statistic entered
Mastering Statistical Analysis with Excel
238
Image showing F statistic value displayed (red circle)
Image showing formula for calculating P value entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing P value displayed (green circle) when Enter Key is pressed.
Since P value is greater than the value of significance which is 0.5 the Null hypothesis cannot be
rejected.
Mastering Statistical Analysis with Excel
240
14
Fourier analysis
F
ourier analysis is a mathematical technique for decomposing complex signals or functions into
simpler components that are easier to understand and manipulate. The technique is named after
Joseph Fourier, who was a French mathematician and physicist.
The basic idea of Fourier analysis is that any complex signal or function can be represented as a
sum of sine and cosine waves of different frequencies. This representation is known as a Fourier
series. The Fourier series provides a way to analyze the frequency content of a signal and extract
useful information from it.
Fourier analysis is used in a wide range of applications, including signal processing, data compression, image analysis, and more. It is a fundamental tool in many branches of science and engineering, and has numerous practical applications in fields such as telecommunications, acoustics,
optics, and more.
In summary, Fourier analysis is a powerful mathematical technique that enables the analysis of
complex signals and functions by decomposing them into simpler components.
Use of Fourier analysis in Biostatistics:
Fourier analysis is a mathematical technique for decomposing complex signals or functions into
simpler components that are easier to understand and manipulate. The technique is named after
Joseph Fourier, who was a French mathematician and physicist.
The basic idea of Fourier analysis is that any complex signal or function can be represented as a
sum of sine and cosine waves of different frequencies. This representation is known as a Fourier
series. The Fourier series provides a way to analyze the frequency content of a signal and extract
useful information from it.
Fourier analysis is used in a wide range of applications, including signal processing, data compression, image analysis, and more. It is a fundamental tool in many branches of science and engineering, and has numerous practical applications in fields such as telecommunications, acoustics,
optics, and more.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
In summary, Fourier analysis is a powerful mathematical technique that enables the analysis of complex signals and functions by decomposing them into simpler components.
Using Excel to perform Fourier analysis:
Performing Fourier analysis using Excel involves using the built-in tools of the program to calculate
the Fourier coefficients of a time-domain signal, and then using those coefficients to reconstruct the
signal in the frequency domain.
Here are the steps to perform Fourier analysis in Excel:
1. Input your time-domain data into an Excel spreadsheet, with one column representing the time
values and another column representing the signal values.
2. Highlight the signal values column and select the “Data” tab from the Excel ribbon. Then select
“Data Analysis” and choose “Fourier Analysis” from the list of analysis tools.
3. In the Fourier Analysis dialog box that appears, select the input range for your signal data, as well as
the output range where you want the Fourier coefficients to be placed.
4. Click “OK” to run the Fourier Analysis tool. The output range will now contain the Fourier coefficients for your signal.
5. Use the Fourier coefficients to reconstruct the signal in the frequency domain by applying the
inverse Fourier transform. To do this, select an output range for the reconstructed signal, and use the
formula =IMREAL(INVERSE.FOURIER(FFToutput)), where FFToutput is the range of Fourier coefficients you obtained in step 4.
6. Finally, plot the reconstructed signal in the frequency domain using a line chart, with the frequency
values on the x-axis and the amplitude values on the y-axis.
Note that Excel’s Fourier Analysis tool is limited to analyzing one-dimensional time-series data. If you
have multidimensional data or want more advanced Fourier analysis capabilities, you may need to use
specialized software or programming languages such as MATLAB or Python.
Here’s an example data set you can use for performing Fourier analysis in Excel:
Time (sec)
0
1
2
3
4
5
6
7
Signal
1
2
3
4
5
4
3
2
Mastering Statistical Analysis with Excel
242
And here are the steps to perform Fourier analysis using Excel:
1. Input the time-domain data into an Excel spreadsheet, with one column representing the time
values and another column representing the signal values.
2. Highlight the signal values column and select the “Data” tab from the Excel ribbon. Then select
“Data Analysis” and choose “Fourier Analysis” from the list of analysis tools.
3. In the Fourier Analysis dialog box that appears, select the input range for your signal data (in this
example, it would be the range A2:B9), as well as the output range where you want the Fourier coefficients to be placed (for example, the range D2:E9).
4. Under Options, make sure the “Complex Output” and “Two-Sided” boxes are checked. The “Complex Output” option will provide both real and imaginary components of the Fourier coefficients,
while the “Two-Sided” option will provide both positive and negative frequencies.
5. Click “OK” to run the Fourier Analysis tool. The output range will now contain the Fourier coefficients for your signal.
6. Use the Fourier coefficients to reconstruct the signal in the frequency domain by applying the inverse Fourier transform. To do this, select an output range for the reconstructed signal (for example,
the range G2:H9), and use the formula =IMREAL(INVERSE.FOURIER(E2:E9)), where E2:E9 is the
range of Fourier coefficients you obtained in step 5.
7. Finally, plot the reconstructed signal in the frequency domain using a line chart, with the frequency values on the x-axis and the amplitude values on the y-axis.
That’s it! With these steps, you can perform Fourier analysis using Excel. Note that you can adjust the
size of the input and output ranges as needed for your specific data set.
Step by step approach to fourier analysis using Excel:
First, let’s define the example dataset we’ll be using:
Time (s)
0
1
2
3
4
5
6
7
8
9
Signal Value
1.0
0.7
-0.3
-1.0
-0.5
0.5
1.0
0.5
-0.5
-1.0
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
This dataset represents a periodic signal with a frequency of approximately 0.1 Hz.
Here are the steps to perform Fourier analysis on this dataset using Excel:
1. Open a new Excel spreadsheet and enter the time and signal value data into two columns.
2. In the third column, calculate the Fast Fourier Transform (FFT) of the signal using the following
formula: =ABS(FFT(B2:B11))/10. This formula calculates the FFT of the signal values in cells B2 to
B11 and divides the result by the number of data points (10) to normalize the amplitude.
3. Copy the FFT formula down the column to apply it to all rows.
4. Create a line chart by highlighting the three columns of data and selecting “Insert” from the top
menu bar. Choose the line chart option with markers.
5. Right-click on the chart and select “Select Data” from the dropdown menu. Click on “Add” to add
a new series of data.
6. In the “Edit Series” dialog box that appears, enter a name for the series (such as “FFT”) and select
the column of FFT values you just calculated.
7. Click “OK” to close the dialog box and return to the chart. You should now see a second line plotted on the chart, representing the FFT of the signal.
8. Adjust the x-axis and y-axis scales to better view the FFT by right-clicking on the chart and selecting “Format Axis” from the dropdown menu.
9. Under the “Axis Options” tab, set the minimum x-axis value to 0 and the maximum x-axis value to
5. This will show the frequency range from 0 to 0.5 Hz.
10. Under the “Vertical Axis” tab, select the “Logarithmic scale” checkbox to enable logarithmic scaling on the y-axis.
11. Adjust the formatting of the chart as needed to improve readability.
That’s it! You’ve now performed Fourier analysis on the signal using Excel and visualized the results
with a chart.
Mastering Statistical Analysis with Excel
244
Image showing data entered in to two columns. Data analysis tab under Data tab is clicked to open
up the Data analysis window. Fourier is chosen and Ok button is clicked.
Image showing Fourier Analysis window. In the Input range field cursor is placed and clicked.
When it starts to blink select the second row of data which needs to be analyzed. The cell addresses
are automatically entered into the field. If the Heading is included Labels in the First row should be
checked if not it is left unchecked. In the output range window the cells where the results need to be
displayed are chosen. The address of these cells are automatically entered into the output range field.
On clicking the OK button the result gets displayed.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the result displayed.
Image showing line graph created with the dataset
Mastering Statistical Analysis with Excel
246
15
Histogram
A
histogram is a graphical representation of the distribution of a dataset. It is a way of showing the
frequency distribution of continuous data. In a histogram, the data is divided into a set of intervals
or bins, and the number of observations that fall within each bin is plotted as a bar. The height of each
bar corresponds to the frequency or number of observations in that bin.
Histograms are useful for quickly identifying the shape of the distribution of a dataset, as well as
identifying any outliers or unusual values. They are commonly used in data analysis, statistics, and
scientific research to visualize the distribution of continuous variables, such as age, height, weight, and
temperature. They can also be used to compare the distributions of two or more datasets.
Histograms play an important role in statistics as they are a common tool used to visualize the distribution of a dataset. They are useful in understanding the shape, center, and spread of a dataset, as well
as identifying outliers and unusual values.
Histograms can provide important insights into the underlying structure of the data, and can help
researchers to identify patterns or trends that may be present. They can also help to identify data that
is skewed or non-normal, which can impact the validity of statistical tests.
Histograms can also be used to compare the distribution of two or more datasets. By overlaying histograms of different datasets, researchers can quickly identify differences in the shape, center, and spread
of the data. This can be useful in determining whether two groups are significantly different from each
other, and can help to identify any factors that may be contributing to differences in the data.
Overall, histograms are a valuable tool for statisticians and researchers, as they provide a quick and
easy way to visualize the distribution of a dataset and identify important features of the data.
Advantages of Histogram:
Histograms have several advantages that make them a popular tool for visualizing data. Some of the
advantages of histograms include:
1. Easy to understand: Histograms are easy to read and interpret, even for those without a statistical
background. The bars represent the number of observations in each bin, providing a clear picture of
the distribution of the data.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
2. Quick to create: Histograms are relatively quick to create, making them a useful tool for exploring
data and identifying patterns.
3. Visualize large datasets: Histograms can be used to visualize large datasets with many observations,
providing a clear picture of the overall distribution of the data.
4. Identify outliers: Histograms can help to identify outliers or unusual values that may be present in
the data. These outliers can be important to identify, as they can impact the validity of statistical tests
and models.
5. Compare distributions: Histograms can be used to compare the distribution of two or more datasets. By overlaying histograms of different datasets, researchers can quickly identify differences in the
shape, center, and spread of the data.
6. Identify patterns: Histograms can be used to identify patterns or trends in the data, which can be
useful in developing hypotheses or identifying areas for further investigation.
Overall, histograms are a valuable tool for visualizing data and can provide important insights into the
underlying structure of the data.
Steps to understand a histogram:
Here are some steps to understand and interpret a histogram:
1. Look at the horizontal axis: The horizontal axis of the histogram represents the range of values for
the variable being plotted. Each bar on the histogram represents a specific range of values, known as a
bin. The width of each bin is determined by the range of values for the variable being plotted.
2. Look at the vertical axis: The vertical axis of the histogram represents the frequency or count of observations in each bin. The height of each bar on the histogram represents the number of observations
that fall within each bin.
3. Identify the shape of the distribution: The shape of the histogram can provide important insights
into the distribution of the data. Histograms can have different shapes, including normal, skewed,
bimodal, or uniform. A normal distribution has a bell-shaped curve, while a skewed distribution has a
longer tail on one side.
4. Identify the center of the distribution: The center of the distribution can be identified by looking
for the bin with the highest frequency or count. This is often referred to as the mode or peak of the
distribution.
5. Identify the spread of the distribution: The spread of the distribution can be identified by looking at
the width of the histogram bars. A wider histogram indicates a larger range of values for the variable
being plotted, while a narrower histogram indicates a smaller range of values.
6. Look for outliers: Outliers are values that are significantly different from the rest of the data. They
Mastering Statistical Analysis with Excel
248
can be identified by looking for bars that are much taller or shorter than the other bars in the histogram.
By following these steps, one can easily understand and interpret a histogram to gain insights into
the underlying distribution of the data.
Types of data distribution as seen in a histogram:
Histograms can show different types of data distributions, depending on the shape of the bars. Here
are some common types of data distributions that can be seen in a histogram:
Normal distribution: A normal distribution is symmetrical and bell-shaped, with the highest frequency in the center and a gradual decrease in frequency on either side. In a histogram, a normal
distribution will have bars that are approximately the same height in the center and gradually decrease in height towards the edges.
Skewed distribution: A skewed distribution is not symmetrical and has a longer tail on one side.
There are two types of skewed distributions: positively skewed and negatively skewed. In a positively
skewed distribution, the tail is on the right side, and in a negatively skewed distribution, the tail is
on the left side. In a histogram, a skewed distribution will have bars that are higher on one side and
gradually decrease in height towards the other side.
Bimodal distribution: A bimodal distribution has two distinct peaks, indicating that there are two
groups of data within the dataset. In a histogram, a bimodal distribution will have two bars that are
roughly the same height and separated by a trough.
Uniform distribution: A uniform distribution has bars that are approximately the same height and
indicates that the values of the variable are evenly distributed across the range. In a histogram, a uniform distribution will have bars that are roughly the same height across the entire range of values.
Overall, histograms can show a variety of data distributions, and by understanding the shape of the
bars, one can gain insights into the underlying structure of the data.
Creating histogram using Excel:
Here’s a sample dataset for plotting a histogram:
1, 2, 2, 3, 4, 4, 4, 5, 5, 6, 6, 6, 6, 7, 8, 8, 8, 8, 9, 10
Here are the steps to create a histogram in Excel:
1. Enter the data into a new worksheet in Excel.
2. Click on the “Insert” tab in the Excel ribbon.
3. Click on the “Histogram” button in the “Charts” section of the ribbon.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
4. In the “Histogram” dialog box, select the range of data you want to use for the histogram.
5. Specify the bin range and bin width for the histogram. The bin range is the range of values you
want to include in each bin, and the bin width is the size of each bin. For example, you could specify
a bin range of 1 to 10 and a bin width of 1.
6. Choose whether to create the histogram in a new worksheet or embed it in the current worksheet.
Click “OK” to create the histogram.
Image showing data entered into columns
Mastering Statistical Analysis with Excel
250
Image showing frequency of data range calculated using the fromula as shown in the image above
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Data Analysis window opened by clicking on Data Analysis tab which will appear
when Data tab is clicked
Image showing Histogram window where the Input range is filled selecting the cell range and the Bin
range is filled again by selecting the bin range. Since the header of the column is also selected the button before label menu is checked. On clicking OK the histogram would be generated.
Mastering Statistical Analysis with Excel
252
Image showing Histogram generated
Once you’ve created the histogram, you can modify the chart title, axis titles, and other formatting
options to customize the look of the chart.
Example dataset that resembles normal distribution:
1. Open a new Excel worksheet and enter the following formulas in cells A1 and A2 respectively:
=NORM.INV(RAND(),50,10)
=NORM.INV(RAND(),50,10)
This will generate two random values that follow a normal distribution with a mean of 50 and a standard deviation of 10.
This will generate two random values that follow a normal distribution with a mean of 50 and a standard deviation of 10.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing two columns of Excel filled with random numbers using the formula highlighted in
green. Cells can be auto populated by dragging the handle downwards.
2. Select cells A1 and A2, and then drag the fill handle down to fill the formula down to as many
rows as you need. For example, if you want to generate 1000 values, drag the fill handle down to row
1000.
3. Select the entire column A, and then click on the “Insert” tab in the ribbon.
4. Click on the “Histogram” button in the “Charts” group.
5. In the “Histogram” dialog box, select “Column” chart type, and then click on the “Bins” field to
specify the number of bins you want. For example, you can set the number of bins to 20.
Mastering Statistical Analysis with Excel
254
Image showing Histogram showing Normal distribution
6. Click on “OK” to create the histogram.
The resulting histogram will show the distribution of the random values generated by the NORM.
INV function, which should resemble a normal distribution with a mean of 50 and a standard deviation of 10. You can adjust the parameters of the NORM.INV function to generate a different mean
and standard deviation if desired.
Positively skewed distribution:
Here’s a sample dataset that demonstrates a skewed distribution:
12, 16, 18, 20, 22, 24, 26, 28, 30, 40, 50, 60, 70, 80, 90, 100
This dataset has a positively skewed distribution because most of the values are on the lower end of
the scale, with only a few values on the higher end.
To create a skewed distribution in Excel, you can follow these steps:
1. Open a new Excel spreadsheet and enter a list of values. These can be any set of numbers, but to
create a skewed distribution, it’s best to use a set of values that are not evenly distributed.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
2. Select the data range by clicking and dragging over the cells containing your data.
3. Click the “Insert” tab in the top menu and select the “Column” chart type. Choose any chart style
you like.
4. Your chart will appear in the worksheet. Click on the chart to activate it.
5. Right-click on any of the data bars in the chart and select “Format Data Series” from the dropdown menu.
6. In the “Format Data Series” dialog box, click on the “Series Options” tab.
7. Adjust the “Gap Width” slider to reduce the spacing between the bars. This will create a narrower
histogram-like chart.
8. Check the “Logarithmic scale” box to change the chart scale to a logarithmic one, which is a common technique to make skewed distributions more visible.
9. Click “Close” to close the dialog box and view the chart with the new settings. You should now
have a histogram-style chart that shows a skewed distribution.
Image showing positively skewed data distribution
Mastering Statistical Analysis with Excel
256
Here’s an example dataset that exhibits a negatively skewed distribution:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500
When plotted on a histogram, the distribution would appear skewed to the left, with a longer tail on
the left side and the peak of the distribution on the right side.
Bimodal histogram:
Here’s an example dataset that can be used to create a bimodal histogram in Excel:
20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120
To create a bimodal histogram using Excel, follow these steps:
1. Open a new or existing Excel spreadsheet.
2. Enter the dataset in a column or row.
3. Select the dataset by clicking and dragging over the cells.
4. Click on the “Insert” tab in the Excel ribbon.
5. Click on the “Histogram” icon in the “Charts” group.
6. Select the “Histogram” chart type.
7. In the “Histogram” dialog box, select the “Bins” option and enter a value of 2. This will create two
bins for the bimodal distribution.
8. Click “OK” to create the histogram.
Your bimodal histogram should now be created in Excel. Note that you can customize the chart’s appearance by adjusting the chart elements, formatting, and axis labels.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Bimodal Histogram
A bimodal histogram is a type of histogram that displays two distinct peaks or modes in the data distribution. The features of a bimodal histogram are as follows:
1. Two distinct peaks: The bimodal histogram displays two separate peaks that represent the two
modes in the data distribution.
2. Symmetrical or skewed: The peaks in a bimodal histogram can be symmetrical or skewed, depending on the shape of the data distribution.
3. Central tendency: Bimodal histograms often indicate that there are two central tendencies or modes
in the data distribution. This means that the data may represent two different groups or populations.
4. Separation between modes: The separation between the two modes in the bimodal histogram indicates the degree of difference between the two groups or populations.
5. Normal or non-normal distribution: Bimodal histograms can be normal or non-normal in distribu-
Mastering Statistical Analysis with Excel
258
tion, depending on the shape of the data.
6. Outliers: Outliers may be present in a bimodal histogram, which can affect the shape of the distribution and the location of the peaks.
Overall, a bimodal histogram provides a visual representation of the presence of two distinct groups
or populations within a dataset. It is a useful tool for identifying patterns and trends in data analysis
and can help to guide further statistical analysis.
Uniform distribution Histogram:
A uniform distribution histogram is a type of histogram that displays a data distribution that is evenly spread out across the entire range of values. In a uniform distribution, the probability of any given
value occurring is equal to the probability of any other value occurring within the same range. This
means that all values in the distribution have an equal chance of being selected.
The features of a uniform distribution histogram are as follows:
1. Rectangular shape: The histogram of a uniform distribution has a rectangular shape, indicating
that each value in the range is equally likely to occur.
2. Flat top: The top of the histogram is flat, indicating that there are no peaks or valleys in the distribution.
3. Equal probability: Each value in the range has an equal probability of occurring, resulting in a
uniform probability density function.
4. No outliers: Since all values have an equal probability of occurring, outliers are not present in a
uniform distribution.
Uniform distributions are often used as a baseline for comparison with other distributions. They are
also commonly used in simulations and modeling to represent random variables with equal probability of occurrence. Uniform distribution histograms can provide useful insights into the behavior of
certain data sets and can be used to make predictions or decisions based on the likelihood of certain
outcomes.
Sure, here is an example of sample data that could be used to create a uniform distribution histogram:
1. Create a new Excel spreadsheet and enter the following values in a column:
2. Cell A1: “Value”
3. Cells A2 through A101: Random numbers between 0 and 1, generated using the RAND() function.
To create a histogram of this data using Excel, follow these steps:
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
1. Select the data range (A1:A101 in this case).
2. Go to the “Insert” tab on the ribbon and click on “Histogram” in the “Charts” section.
3. Choose “Histogram” from the dropdown list.
4. Excel will create a default histogram with a bin size that it chooses automatically. You can modify
the number of bins to be more or less precise by right-clicking on the chart and choosing “Format
Chart Area”. Under the “Axis Options” tab, you can adjust the bin width as needed.
5. You can also add a chart title and axis labels as needed to make the histogram easier to read.
Once you have created the histogram, you should be able to see that the data is uniformly distributed, with each bin containing roughly the same number of values. This type of histogram is useful for
visualizing data that has a range of values that are all equally likely to occur, such as the outcomes of
a dice roll or the arrival times of buses at a stop.
Image showing Uniform distribution Histogram
Mastering Statistical Analysis with Excel
260
16
Moving Average
oving Average is a statistical method that is used to analyze data points by creating a series of averages over a specified period of time. The moving average is calculated by taking the average of a
set of data points over a specified period of time and then moving the window of time forward, creating
a new average for the next period.
M
For example, if you were analyzing stock prices, you might use a moving average to smooth out the
fluctuations in price over a period of time. You would calculate the average price over a specific time
period, such as 30 days, and then move the window forward by one day, calculating a new average for
the next 30 days. This process continues for the entire data set, resulting in a series of average prices
that can help you identify trends and patterns in the data.
Moving averages can be simple or weighted, depending on how the data is weighted. Simple moving
averages give equal weight to all data points, while weighted moving averages give more weight to
recent data points. Moving averages are commonly used in finance, economics, and other fields to
analyze data over time.
Advantages of using Moving average:
There are several advantages of using moving averages in data analysis:
1. Smooths out fluctuations: One of the main advantages of moving averages is that it smooths out
fluctuations in data by removing short-term fluctuations, thereby making it easier to identify trends
and patterns in the data.
2. Reduces noise: Moving averages reduce noise in the data, which can make it easier to identify
underlying trends and patterns. This can be particularly useful in financial analysis, where short-term
price fluctuations can be difficult to interpret.
3. Provides a clearer picture of the trend: Moving averages can provide a clearer picture of the trend in
the data over time, making it easier to identify the direction of the trend and whether it is increasing,
decreasing, or remaining stable.
4. Simple to calculate: Moving averages are relatively simple to calculate and can be easily implement-
Prof Dr Balasubramanian Thiagarajan MS D.L.O
ed in most software and programming languages.
5. Widely used: Moving averages are a widely used statistical tool, which means that there is a large
body of research and analysis available that can help you interpret the results of your analysis.
Works well with non-stationary data: Moving averages can work well with non-stationary data, which
means that it can be used to analyze data that does not have a constant mean or variance over time.
Calculating moving average using excel:
To calculate a moving average in Excel, you can use the AVERAGE function along with the OFFSET
function. Here are the steps:
1. Enter your data into a column in Excel.
2. Decide on the period you want to use for your moving average (e.g., 30 days).
3. Create a new column next to your data column and label it “Moving Average.”
4. In the first cell of the Moving Average column, enter the following formula:
=AVERAGE(OFFSET($A$1,COUNT(A:A)-B1+1,0,B1,1))
In this formula, A1 is the top cell of the data column, B1 is the cell with the period you want to use,
and COUNT(A:A) counts the number of cells in the data column.
5. Copy the formula down to the rest of the cells in the Moving Average column.
6. The resulting values in the Moving Average column will show the moving average for the specified
period.
Note: You can also use the built-in Moving Average function in Excel, which simplifies the process. To
use the built-in function, select the range of cells that contain your data and then click on “Data” in
the ribbon, select “Data Analysis” and then “Moving Average.”
While moving averages can be a useful tool in data analysis, there are several pitfalls that should be
taken into consideration:
1. Lag: Moving averages introduce a lag into the data, which means that the moving average may not
respond as quickly to changes in the data as other methods, such as exponential smoothing.
2. Sensitivity to outliers: Moving averages can be sensitive to outliers or extreme values in the data. If
there are significant outliers in the data, the moving average may not accurately reflect the underlying
trend.
3. Choice of period: The choice of the period used for the moving average can significantly impact the
results. A short period may provide a more sensitive indicator of short-term changes, but it may also
Mastering Statistical Analysis with Excel
262
introduce more noise into the data, while a longer period may smooth out the data too much and
mask important short-term changes.
4. May not capture sudden changes: Moving averages may not capture sudden changes or shocks in
the data. This is because moving averages are designed to smooth out the data over a period of time,
so sudden changes may take some time to be reflected in the moving average.
5. May not be appropriate for non-linear trends: Moving averages assume a linear trend in the data,
which means that they may not be appropriate for data with non-linear trends, such as exponential
or quadratic trends.
It’s important to consider these potential pitfalls when using moving averages in data analysis and to
use them in conjunction with other methods to gain a more complete understanding of the data.
Example:
Here’s an example data set that we can use to calculate a 5-day moving average:
Date
Price
2022-01-01 10
2022-01-02 12
2022-01-03 14
2022-01-04 16
2022-01-05 18
2022-01-06 20
2022-01-07 22
2022-01-08 24
2022-01-09 26
2022-01-10 28
To calculate a 5-day moving average using Excel, follow these steps:
Enter the above data into Excel in columns A and B.
Create a new column C and label it “Moving Average”.
In cell C6 (the first cell where the moving average will appear), enter the following formula:
=AVERAGE(B2:B6)
This will calculate the average of the first 5 prices.
Copy the formula down to the rest of the cells in the “Moving Average” column. You can do this by
selecting cell C6, hovering over the bottom-right corner until you see a black “+” symbol, and then
dragging down to the last row.
The values in the “Moving Average” column should now show the 5-day moving average for each day
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
in the data set.
Note: As mentioned earlier, Excel also has a built-in “Moving Average” function that you can use. To
use this function, select the range of cells that contain your data (including the headers), then go to
“Data” in the ribbon, select “Data Analysis,” and then “Moving Average.” In the “Moving Average”
dialog box, enter the range of cells that contains your data in the “Input Range” field, the number
of periods you want to use in the “Interval” field, and then select the location where you want the
results to appear. Click “OK” to generate the moving average.
Image showing data entered in Columns A and B. It also shows the location of Moving Average
function listed under Data analysis tab
Mastering Statistical Analysis with Excel
264
Image showing price average calculated using Average formula which is built in Excel. On pressing
the Enter key the value would be entered into the cell. All the user needs to do to fill up the rest of
the cells in column c is to pull down the handle indicated by a square dot at the bottom right corner
of the cell. On dragging down the handle the subsequent cells can automatically be filled with their
respective average values. This autofill feature is actually an excellent time saving feature.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the Moving Average menu listed under Data Analysis window. Data Analysis window can be opened by clicking on the Data tab. This sequence has already beeb described in earlier
chapters. The Moving Average menu is selected and OK button is clicked. On clicking the OK button the Moving average dialog box opens up.
This Image shows the Moving Average Dialog box
Mastering Statistical Analysis with Excel
266
Image showing Moving Average dialog box. Here the cursor is placed in the Input range field and
clicked. When the cursor starts to blink the cells containing the Average values are selected. The
selected cells addresses are found to be automatically entered into this field. If the header is also chosen then the title in the first row box should be clicked. In the output range field the cursor is placed
and clicked. When the cursor starts to blink the area where the result needs to be displayed is selected. On selection the address of the selected cells can be found to be entered into the output range
field. In order to create a chart for the data chart output button needs to be checked.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Moving average values displayed and graphically displayed.
Manual calculation of Moving Average using Excel:
To manually calculate moving average using Excel, you can follow these steps:
1. Enter the data points into a column in Excel.
2. Decide on the number of periods you want to use for the moving average calculation. For example,
if you want to calculate a 3-period moving average, you would use the previous three data points in
the calculation.
3. Create a new column next to the data column and label it “Moving Average.”
4. In the first cell of the Moving Average column, enter the formula “=AVERAGE(A1:A3)” (assuming your data starts in cell A1). This will calculate the moving average for the first three periods.
5. Copy this formula down to the rest of the cells in the Moving Average column, adjusting the cell
references as needed. For example, in the second cell of the Moving Average column, you would use
the formula “=AVERAGE(A2:A4)” to calculate the moving average for the next three periods.
6. You should now have a column with the moving average values for each period.
Note that there are also built-in functions in Excel to calculate moving averages, such as the “AVERAGE” function with the “OFFSET” function or the “AVERAGEIF” function with relative cell refer-
Mastering Statistical Analysis with Excel
268
ences. These functions can save you time and effort in calculating moving averages.
Moving average can be used in biostatistics for various purposes, such as:
1. Trend analysis: Moving average can be used to identify trends in time-series data, such as changes
in disease incidence or mortality rates over time. By calculating the moving average over a specific
time period, you can smooth out random fluctuations in the data and identify underlying trends.
2. Seasonal variations: Moving average can also be used to identify seasonal variations in biostatistics
data, such as seasonal allergies or flu incidence. By calculating the moving average over a period that
corresponds to the seasonal pattern, you can identify any changes or fluctuations that occur at the
same time each year.
3. Outlier detection: Moving average can also be used to identify outliers or unusual data points in
biostatistics data. By comparing individual data points to the moving average, you can identify any
data points that deviate significantly from the expected value and may require further investigation.
4. Smoothing data: Moving average can also be used to smooth out noisy data, such as gene expression data or protein expression levels. By calculating the moving average over a specific period, you
can reduce the impact of random measurement errors or other sources of noise in the data and identify underlying patterns or trends.
Overall, moving average can be a useful tool for biostatisticians to analyze and interpret time-series
data, identify trends, and make predictions about future outcomes.
Types of Moving Averages:
There are several types of moving averages that can be used in statistical analysis, including:
1. Simple Moving Average (SMA): This is the most basic form of moving average, calculated by taking the arithmetic mean of a specified number of data points over a given time period.
2. Weighted Moving Average (WMA): In this type of moving average, more weight is given to the
most recent data points, with decreasing weight assigned to earlier data points.
3. Exponential Moving Average (EMA): EMA is similar to WMA, but it places more weight on
the most recent data points and uses an exponential function to decrease the weight of earlier data
points.
4. Triangular Moving Average (TMA): TMA is a weighted moving average that places more weight
on the middle data points, with decreasing weight assigned to the first and last data points.
5. Adaptive Moving Average (AMA): AMA is a moving average that adjusts the smoothing factor
based on the volatility of the data, with higher smoothing factors used for less volatile data and lower
smoothing factors used for more volatile data.
6. Cumulative Moving Average (CMA): CMA is a moving average that calculates the average of all
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
the data points up to a specific point in time, with equal weight assigned to each data point.
Each type of moving average has its own advantages and disadvantages, and the choice of moving
average depends on the specific data set and the analysis objectives.
Simple moving Average: Already described in this chapter with example.
Weighted Moving average:
A weighted moving average (WMA) is a type of moving average that assigns different weights to different data points in the time series. Unlike a simple moving average (SMA), where each data point
is given equal weight, the WMA places more weight on more recent data points, while gradually
reducing the weight of older data points.
To calculate the WMA, you need to follow these steps:
1. Determine the number of periods you want to use in the calculation. This will be the window size
or the number of data points you want to include in the moving average.
2. Assign weights to each of the data points in the time series. The most recent data point is assigned
the highest weight, and the weight gradually decreases for earlier data points. The weights must add
up to 1.0.
3. Multiply each data point by its corresponding weight.
4. Sum up the products of the data points and weights.
5. Divide the sum by the total weight.
Here is an example calculation of a WMA with a window size of 3 and weights of 0.5, 0.3, and 0.2:
Suppose you have the following data points:
10, 20, 30, 40, 50
The weights for the WMA are: 0.5, 0.3, and 0.2.
The first weighted average is calculated as follows:
(0.5 * 50) + (0.3 * 40) + (0.2 * 30) = 25 + 12 + 6 = 43
The second weighted average is calculated as follows:
(0.5 * 40) + (0.3 * 30) + (0.2 * 20) = 20 + 9 + 4 = 33
And so on.
Mastering Statistical Analysis with Excel
270
WMA can be a useful tool for smoothing out a time series and identifying underlying trends, especially when there is significant variation in the data over time.
Here’s an example of how to calculate the WMA of a set of data points using Excel:
Suppose you have the following data points:
10, 20, 30, 40, 50, 60
And you want to calculate the WMA with a window size of 3, where the most recent data point is
given a weight of 0.6, the second-most recent data point is given a weight of 0.3, and the third-most
recent data point is given a weight of 0.1.
Here are the steps to calculate the WMA in Excel:
1. Enter the data points in a column in an Excel worksheet.
2. In another column, calculate the weights for each data point. In this example, the most recent data
point (60) is given a weight of 0.6, the second-most recent data point (50) is given a weight of 0.3,
and the third-most recent data point (40) is given a weight of 0.1. To do this, enter the weights in a
separate column next to the data points.
3. In another column, multiply each data point by its corresponding weight. To do this, enter the
formula “=A1*B1” (where A1 is the data point and B1 is the weight) in the cell next to the first data
point, and then drag the formula down to the last data point.
4. In another cell, sum up the products of the data points and weights. To do this, enter the formula
“=SUM(C1:C3)” (where C1:C3 are the cells containing the products of the data points and weights).
5. In another cell, divide the sum by the total weight. To do this, enter the formula “=D1/
SUM(B1:B3)” (where D1 is the cell containing the sum of the products and B1:B3 are the cells containing the weights).
6. The result is the WMA for the most recent data point.
7. To calculate the WMA for the next data point, repeat steps 3-6, but shift the window down by one
data point.
8. Repeat this process until you have calculated the WMA for all the data points.
Note that in Excel, you can use the “SUMPRODUCT” function to calculate the sum of the products
of the data points and weights in step 4. For example, if the data points are in column A and the
weights are in column B, the formula would be “=SUMPRODUCT(A1:A3,B1:B3)”.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Exponential Moving Average:
Exponential Moving Average (EMA) is a type of moving average that places greater weight on the
most recent data points in a time series. It is calculated by taking the weighted average of the previous n periods, where the weight of each period is determined by a smoothing factor or a smoothing
constant, which is usually represented by the symbol alpha (α).
The formula for calculating EMA is:
EMA = (Close - EMA_prev) x α + EMA_prev
Where:
Close: the closing price of the asset being tracked.
EMA_prev: the value of the EMA in the previous period.
α: the smoothing factor, which is calculated using the number of periods being considered.
The value of the smoothing factor determines how much weight is given to the most recent data
points in the time series. Generally, the smaller the value of alpha, the greater the weight given to the
older data points, and the smoother the moving average line.
EMA is commonly used in technical analysis of financial markets to identify trends and forecast
future price movements. It is also used in other fields where time series analysis is important, such as
engineering, economics, and epidemiology.
Triangular moving average:
Triangular Moving Average (TMA) is a type of moving average that places greater weight on the
middle portion of the time series. It is similar to other moving averages, but instead of using a simple
arithmetic mean or an exponential weighting function, it uses a triangular weighting function.
The TMA is calculated by taking the average of the data points within a specified number of periods,
and then applying a triangular weighting function to give greater weight to the data points in the
middle of the time series. The formula for calculating TMA is:
TMA = (w1 x P1) + (w2 x P2) + (w3 x P3) + ... + (wn x Pn)
where:
w1, w2, w3, ..., wn are the weights given to each period. These weights follow a triangular pattern,
with the middle period receiving the highest weight, and the weights tapering off towards the edges
of the time series.
P1, P2, P3, ..., Pn are the data points in the time series being averaged.
The TMA is used in technical analysis to smooth out price movements and identify trends in financial markets. It can also be used in other fields where time series analysis is important, such as weather forecasting and econometrics.
Mastering Statistical Analysis with Excel
272
Adaptive moving average:
Adaptive Moving Average (AMA) is a technical analysis indicator that is used to smooth out price
movements in financial markets. It is a variation of the traditional moving average (MA) that adjusts
its sensitivity based on market volatility.
AMA applies a smoothing factor that gives greater weight to recent price data when market volatility
is high, and less weight to old price data when volatility is low. The formula used to calculate AMA
involves a variable called the Efficiency Ratio (ER), which measures the strength of the current trend
in the market.
The AMA indicator is useful for identifying trends and potential trend reversals. It can be used in
conjunction with other technical indicators and chart patterns to make trading decisions. When the
AMA is rising, it indicates a bullish trend, and when it is falling, it indicates a bearish trend.
Overall, the Adaptive Moving Average is a useful tool for traders who want to adjust their trading
strategies to changes in market conditions. It helps to eliminate noise in the price data and provides a
clearer picture of the underlying trend.
Cumulative moving average:
A cumulative moving average (CMA) is a type of moving average that calculates the average of a set
of data points over a specified period of time. Unlike a simple moving average, which only considers
a fixed number of most recent data points, a CMA takes into account all the data points in the specified period and assigns greater weight to more recent data points.
To calculate the CMA, you first need to determine the period for which you want to calculate the average. Then, you add up all the data points in that period and divide the sum by the number of data
points. This gives you the initial CMA value. From there, you can update the CMA with each new
data point by using the following formula:
CMA = (New Data Point + (Period - 1) * Previous CMA) / Period
In this formula, the “New Data Point” refers to the most recent data point, and “Previous CMA” refers to the CMA value calculated for the previous period. By using this formula, you can calculate the
CMA for any period and update it as new data becomes available.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Mastering Statistical Analysis with Excel
274
17
Random Number Generation
R
andom numbers are numbers that are generated by a process that is unpredictable and non-reproducible. These numbers are used in a variety of applications, such as cryptography, simulations, and
statistical analysis.
True random numbers are generated by natural phenomena, such as atmospheric noise, radioactive
decay, or thermal noise. These sources of randomness are considered to be truly random because they
are inherently unpredictable and unbiased.
Pseudo-random numbers, on the other hand, are generated by algorithms or mathematical formulas.
Although they appear to be random, they are actually deterministic and follow a pattern that can be
reproduced if the algorithm or formula is known.
Random numbers have many important applications, such as in cryptography to create secure keys
and in simulations to model complex systems. They are also used in statistical analysis to create random samples and to test the validity of statistical models.
Use of Random Numbers in statistics:
Random numbers are crucial in statistics for several reasons:
1. Sampling: Random numbers are used to select a random sample from a larger population. This
ensures that the sample is representative of the population, which is important for making accurate
inferences about the population as a whole.
2. Randomization: Random numbers are used in experimental design to randomize subjects into
treatment and control groups. This helps to eliminate bias and ensures that the results of the experiment are not due to any pre-existing differences between the groups.
3. Monte Carlo simulations: Monte Carlo simulations are used to model complex systems or processes
using random numbers. By generating random numbers, a simulation can create a range of possible
outcomes, which can help to estimate the probability of different outcomes.
4. Hypothesis testing: Random numbers are used to simulate the null distribution of a test statistic in
hypothesis testing. This allows us to determine the probability of observing a particular result if the
null hypothesis is true.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
In all of these applications, random numbers are used to ensure that the results are unbiased and
statistically valid. Without random numbers, statistical analyses would be prone to bias and the results
would be less reliable.
There are different ways to generate random numbers, depending on the type of randomness required.
Here are some methods for generating random numbers:
1. Pseudo-random number generators (PRNGs): PRNGs are deterministic algorithms that generate a
sequence of numbers that appear to be random. They are commonly used in computer programming
and statistical simulations. PRNGs require a seed value to initiate the sequence, which can be a randomly generated value or a fixed value.
2. Hardware random number generators: These devices generate truly random numbers by measuring
natural phenomena, such as radioactive decay, thermal noise, or atmospheric noise. They are often
used in cryptography, where true randomness is required for security.
3. Sampling from a distribution: Random numbers can be generated by sampling from a known distribution, such as a uniform distribution or a normal distribution. This can be done using mathematical
functions or using pre-existing libraries in programming languages like Python or R.
4. Lottery machines: Lottery machines use mechanical devices to generate random numbers. These
are often used in lottery drawings and other games of chance.
It is important to note that while PRNGs can generate sequences of numbers that appear to be random, they are ultimately deterministic and can be predicted if the algorithm or seed value is known.
For applications that require true randomness, hardware random number generators or sampling
from a distribution are preferred.
Generating Random Numbers using Excel:
Excel has a built-in function for generating random numbers called “RAND”. Here’s how to use it:
1. Open Excel and select the cell where you want to generate a random number.
2. Type “=RAND()” (without quotes) in the cell and press Enter.
3. Excel will generate a random number between 0 and 1 in the selected cell.
4. If you want to generate a random number within a specific range, such as between 1 and 100, you
can use the following formula: “=RAND()*(100-1)+1” (without quotes). This will generate a random
number between 1 and 100.
5. If you want to generate multiple random numbers at once, you can use the “Fill” feature in Excel.
Simply select the cell with the first random number, click and drag the fill handle (the small square at
the bottom right corner of the cell) to the desired number of cells, and release. Excel will automatically
fill in each cell with a new random number.
Mastering Statistical Analysis with Excel
276
Note that the RAND function generates pseudo-random numbers, which are deterministic and can
be predicted if the formula is known. If you need truly random numbers for security or other sensitive applications, you should consider using a dedicated random number generator or a more sophisticated algorithm.
Image showing code entered in the first cell. On clicking Enter key random value between 0 and 1
would be found entered. Note a small square at the right lower margin. The same can be used as a
handle and it can be pulled down to autofill the lower cells by a random number between 0 and 1.
Image showing the cells filled with random numbers between 0 and 1.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing code entered to generate random numbers between 1 to 100.
Image showing a series of random numbers between 1 and 100 generated and filled in to cells when
the autofill handle is pulled downwards.
Mastering Statistical Analysis with Excel
278
Generating Random Number using inbuilt Random number generator in Excel:
Data Analyser dialog box should be opened:
Data Analyser tab will be shown on clicking Data tab.
Data Analyser tab is clicked next to open the Data Analyser window.
In the Data Analyzer window choose Random number Generation and press OK button. This will
open up the Random number generator window.
Image showing Data Analyser tab and window.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing fields in the Random number generation window filled. On clicking OK button random numbers will be generated.
Image showing Random numbers generated
Mastering Statistical Analysis with Excel
280
Random numbers are used in statistics to ensure that the analysis and conclusions drawn from the
data are based on valid and reliable methods. Here are some reasons why random numbers are important in statistics:
1. Reducing bias: Random sampling helps to reduce bias in the selection of samples from a population. By using a random sampling technique, every member of the population has an equal chance of
being selected, which helps to ensure that the sample is representative of the population.
2. Enhancing accuracy: Random numbers are used in statistical models to generate simulations of
complex systems and processes. These simulations can help to provide more accurate predictions of
future outcomes or behavior, which can be useful in fields such as finance, economics, and weather
forecasting.
3. Minimizing error: Random assignment of participants to experimental groups helps to minimize
the risk of error in experimental design. By randomly assigning participants, researchers can ensure
that any observed differences between groups are due to the experimental manipulation and not due
to other variables that could confound the results.
4. Providing a baseline for comparison: Random numbers can be used to create a null distribution,
which provides a baseline for comparison with observed data. This helps to determine whether the
observed results are statistically significant or could have occurred by chance.
In summary, random numbers are a critical component of statistical analysis because they help to
ensure that the results are based on valid and reliable methods that reduce bias, enhance accuracy,
minimize error, and provide a baseline for comparison.
Random numbers are used in sampling to ensure that the sample is representative of the population
and to reduce the risk of bias in the selection of participants. Here are the general steps for using
random numbers in sampling:
1. Define the population: Identify the population from which you want to select a sample. This could
be a group of people, objects, or data points that share common characteristics.
2. Determine the sample size: Decide on the number of participants you want to include in the sample. The sample size should be large enough to be representative of the population, but not so large
that it becomes impractical or expensive to collect data.
3. Generate random numbers: Use a random number generator to generate a set of random numbers. The range of the numbers should correspond to the size of the population. For example, if the
population size is 100, the random numbers should range from 1 to 100.
4. Assign the random numbers to the population: Assign the random numbers to each member of
the population. This can be done by sorting the population by the random numbers or by using the
random numbers to select a sample from the population.
5. Select the sample: Use the random numbers to select the participants for the sample. For example,
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
if you want a sample size of 20, you would select the 20 participants in the population with the corresponding random numbers.
6. Collect data: Collect data from the selected participants and analyze the results.
By using random numbers in sampling, you can ensure that every member of the population has an
equal chance of being selected, which helps to reduce bias and increase the representativeness of the
sample.
Mastering Statistical Analysis with Excel
282
18
R
Rank and Percentile
ank and percentile are two measures used in statistics to describe the relative position of a data point
in a dataset.
Rank refers to the position of a data point in a sorted list of all the values in the dataset. For example,
if we have a set of numbers {10, 5, 7, 3, 8}, the rank of the value 7 would be 3 because it is the third
value when we sort the list in ascending order.
Percentile, on the other hand, refers to the percentage of values in a dataset that are below a certain
value. For example, if a student scores in the 90th percentile on a test, it means that they scored higher
than 90% of the other students who took the test.
To calculate the percentile of a data point in a dataset, we first need to find its rank, and then use the
following formula:
Percentile = (Rank / n) x 100
Where n is the total number of values in the dataset.
For example, if the rank of a value is 20 in a dataset of 100 values, its percentile would be (20/100) x
100 = 20. This means that 20% of the values in the dataset are below this particular value.
Rank and percentile are important concepts in statistics that are used to describe the distribution of
data and to make comparisons between different datasets.
Rank refers to the position of a data point in a dataset when it is sorted in either ascending or descending order. The rank of a data point can be used to determine its relative position in the dataset
and can be useful in making comparisons between different datasets.
Percentile is a measure that divides a dataset into 100 equal parts. Each percentile represents the percentage of data points that fall below that value. For example, the 75th percentile represents the value
below which 75% of the data points fall. Percentiles are useful for understanding the spread of data
and can be used to compare different datasets.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
In summary, rank and percentile are important statistical measures that are used to describe the distribution of data and to make comparisons between different datasets. They provide a standardized way
of comparing data points and can be useful in a variety of statistical analyses.
To calculate the rank and percentile of a dataset, you can follow these steps:
1. Sort the dataset in either ascending or descending order.
2. Assign a rank to each data point based on its position in the sorted dataset. The smallest value gets a
rank of 1, the second smallest value gets a rank of 2, and so on.
3. To calculate the percentile of a particular data point, use the following formula:
percentile = (number of data points below the given point / total number of data points) x 100%
For example, if a dataset has 20 data points and you want to calculate the percentile of a data point
that falls at rank 10, then:
percentile = (9 / 20) x 100% = 45%
This means that the given data point is greater than or equal to 45% of the data points in the dataset.
4. Alternatively, you can use Excel or other statistical software to calculate the rank and percentile of a
dataset automatically.
Note that when calculating the percentile, it is important to use the correct formula depending on
whether you want to calculate a specific percentile or a range of percentiles. For example, to calculate
the median (50th percentile) of a dataset, you would use a different formula than if you wanted to
calculate the 25th percentile or the 75th percentile.
Here’s an example dataset:
10, 15, 18, 22, 25, 27, 30, 32, 35, 40
To calculate the rank and percentile of this dataset in Excel, you can follow these steps:
1. Sort the dataset in ascending order by selecting the data and then clicking on the “Sort & Filter”
button in the “Home” tab of the Excel ribbon. Choose “Sort Smallest to Largest” to sort the data in
ascending order.
2. Enter the following formula in cell B1 to assign a rank to each data point:
=RANK.AVG(A1,$A$1:$A$10,1)
This formula uses the RANK.AVG function to assign a rank to each data point in the dataset. The first
argument (A1) is the cell containing the first data point, the second argument ($A$1:$A$10) is the
Mastering Statistical Analysis with Excel
284
range containing the entire dataset, and the third argument (1) specifies that the function should
rank the data in ascending order.
3. Press enter and then drag the formula down to apply it to the entire column B. The resulting values
in column B represent the rank of each data point in the dataset.
4. To calculate the percentile of a specific data point (e.g., the value 25), enter the following formula
in cell C1:
=PERCENTILE.INC($A$1:$A$10,0.5)
This formula uses the PERCENTILE.INC function to calculate the 50th percentile (i.e., the median)
of the dataset. The first argument ($A$1:$A$10) is the range containing the entire dataset, and the
second argument (0.5) specifies that the function should calculate the median.
5. To calculate the percentile of a data point that falls between two other percentiles (e.g., the value
20, which falls between the 25th and 50th percentiles), enter the following formula in cell C2:
=(COUNTIF($A$1:$A$10,”<”&20)+0.5)/COUNT($A$1:$A$10)*100
This formula calculates the number of data points that fall below the value 20 (which is 2), adds 0.5
to account for the fact that the value falls between two percentiles, divides by the total number of
data points (which is 10), and then multiplies by 100 to convert the result to a percentage.
6. Press enter and then drag the formula down to apply it to the entire column C. The resulting values in column C represent the percentile of each data point in the dataset.
Note that there are different ways to calculate the percentile in Excel, and the formulas used may
vary depending on the specific requirements of your analysis.
Sorting the data in ascending or descending order is a must before proceeding further. This action
can easily be performed in Excel by using its Data sorting feature. Ideally dataset should be sorted or
arranged from minimum to maximum values. (Smallest to the largest value).
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Data entered
Image showing sort and filter tab in Excel . This can be accessed under Data tab.
Mastering Statistical Analysis with Excel
286
Image showing formula entered to calculate Rank Average function
Image showing Rank of individual values filled up. This can be done using the autofill feature of
Excel
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Formula to calculate percentile Entered
Another way of calculating Rank and percentile in Excel is to use Data Analysis feature. This has a
built in Rank and percentile function.
Step 1:
Data should be entered into Excel spread sheet.
Step 2:
Data Tab is clicked. This reveals the Data Analysis tab.
Step 3:
Data Analysis tab is clicked to open up the various functions under this category. In this window
Rank and Percentile is chosen. On clicking OK button Rank and Percentile window opens.
Mastering Statistical Analysis with Excel
288
Image showing Data Analysis window with Rank and Percentile function selected
Image showing Rank and percentile window
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
In the Rank and percentile window the user should click on the input field. When the cursor starts
to blink the cells containing data set that needs to be analysed should be chosen. The moment the
data set is chosen the cell addresses are found entered into the input field. If the first row of the dataset contains label then the box before First row contains label should be checked,
Then the cursor is placed over the output range field and the cells where the results needs to be
displayed are selected. The same would be displayed in the input field. If the user desires to display
the result as a separate worksheet, then that radio button needs to be selected. On clicking the OK
button the result would be seen displayed.
Image showing Rank and percentile result displayed.
Mastering Statistical Analysis with Excel
290
Regression
19
R
egression analysis is a statistical method used to examine the relationship between a dependent
variable and one or more independent variables. The goal of regression analysis is to determine how
much the independent variables influence the dependent variable and to use this information to make
predictions about the dependent variable.
There are two main types of regression analysis: simple linear regression and multiple linear regression. In simple linear regression, there is only one independent variable, and the relationship between
that variable and the dependent variable is modeled with a straight line. In multiple linear regression,
there are multiple independent variables, and the relationship between those variables and the dependent variable is modeled with a linear equation.
Regression analysis is commonly used in various fields, including economics, finance, engineering,
and social sciences, to model relationships between variables and make predictions about future outcomes.
Simple Linear Regression:
Simple linear regression is a statistical method used to analyze the relationship between two variables,
typically a dependent variable and an independent variable, where the relationship between the two
variables can be represented by a straight line. The goal of simple linear regression is to find the line
that best fits the data, so that the relationship between the two variables can be accurately modeled
and used to make predictions.
The basic equation for simple linear regression is:
y = b0 + b1*x
where y is the dependent variable, x is the independent variable, b0 is the y-intercept, and b1 is the
slope of the line.
To determine the values of b0 and b1, the regression analysis method uses a technique called least
squares regression. This involves finding the line that minimizes the sum of the squared differences
between the actual y-values and the predicted y-values based on the line.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
Simple linear regression can be used to model various relationships between two variables, such as the
relationship between a person’s height and weight, or the relationship between an employee’s experience and salary.
Simple linear regression can be used to analyze the relationship between two variables where one variable is dependent on the other variable, and where the relationship can be represented by a straight
line. Here are some scenarios where simple linear regression might be used:
1. Predicting sales based on advertising spend: A business might use simple linear regression to model
the relationship between their advertising spend and their sales. By analyzing the data, they can determine the impact of advertising on sales and use that information to make predictions about future
sales based on their advertising spend.
2. Examining the relationship between education and income: A researcher might use simple linear
regression to examine the relationship between a person’s level of education and their income. By analyzing the data, they can determine whether there is a correlation between education and income, and
if so, how strong the relationship is.
3. Analyzing the effect of temperature on crop yield: An agriculture scientist might use simple linear
regression to analyze the effect of temperature on crop yield. By analyzing the data, they can determine whether there is a correlation between temperature and crop yield, and if so, how strong the
relationship is.
4. Predicting employee performance based on experience: An HR professional might use simple linear
regression to model the relationship between an employee’s experience and their job performance. By
analyzing the data, they can determine whether there is a correlation between experience and performance, and use that information to make predictions about the performance of future employees
based on their experience level.
These are just a few examples of scenarios where simple linear regression might be used. In general,
simple linear regression can be a useful tool whenever there is a relationship between two variables
that can be modeled with a straight line.
Here is an example of sample data for performing simple linear regression using Excel:
Advertising Spend
Sales
10
20
30
40
50
60
70
80
100
200
300
400
500
600
700
800
Mastering Statistical Analysis with Excel
292
90
100
900
1000
Here are the steps to perform simple linear regression using Excel:
1. Open Microsoft Excel and create a new workbook.
2. Enter the sample data into two columns, with the independent variable in one column and the
dependent variable in the other.
3. Select the data range by highlighting both columns.
4. Click on the “Insert” tab in the ribbon menu and then click on the “Scatter” chart type.
5. Excel will create a scatter plot of the data. Right-click on any data point and select “Add Trendline”.
6. In the “Add Trendline” dialog box, select the “Linear” trendline option and check the box that says
“Display equation on chart”.
7. Click on “Close” to close the dialog box.
8. The chart will now display the trendline equation and the R-squared value, which represents how
well the trendline fits the data.
9. To use the trendline equation to make predictions, enter a new value for the independent variable
into a cell in the worksheet.
10. In another cell, use the trendline equation to calculate the predicted value for the dependent variable based on the new independent variable value.
That’s it! By following these steps, you can use Excel to perform simple linear regression and make
predictions based on the relationship between two variables.
Image showing Data entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing submenu under Insert tab. click on Scatter chart type (purple circle). Chart will be
displayed as shown.
Image showing a data point selected and right clicked. In the ensuing submenu trendline is chosen
Mastering Statistical Analysis with Excel
294
Image showing Linear Trendline selected
Regression can also be performed using Data Analysis feature available in Excel. This feature needs
the user to install a plug in. The installation process of Data Analysis Plug in has been explained in
previous chapters.
Using Data Analysis feature to plot regression:
In this process Data Analysis feature is used to perform regression analysis. Data Analysis tab would
be listed under Data tab. On clicking the Data tab this tab becomes evident. On clicking the Data
Analysis tab the Data Analysis window opens up. In the Data Analysis window Regression is chosen
and OK button is clicked. This opens up the Regression Analysis window.
In the Regression Analysis window the cursor is placed over input Y range. When the cursor starts
to blink the cells containing sales data are selected including the column header. The selected cell
addresses would be seen automatically entered into the input Y range field. Since the column headers are included in the selection the box infront of Labels is checked.
In the input X range the cursor is placed and the mouse is clicked. The cursor starts to blink. When
the cursor starts to blink the column containing Advertisement Spend is selected. Automatically the
addresses of these cells are found entered into the input X range field. The output options are selected next. The user has the option of selecting cells in the same sheet or to creating a new sheet to
display the results. Accordingly the result would be displayed.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Regression Analysis window
Image showing summary output displayed
Mastering Statistical Analysis with Excel
296
Image showing trend line
Multiple Linear Regression:
Multiple linear regression is a statistical method used to analyze the relationship between a dependent variable and two or more independent variables. It extends the idea of simple linear regression
to the case where there are multiple independent variables that may influence the dependent variable.
In multiple linear regression, the relationship between the dependent variable and the independent
variables is represented by a linear equation with multiple coefficients. The equation takes the form:
y = b0 + b1x1 + b2x2 + ... + bn*xn
where y is the dependent variable, x1, x2, ..., xn are the independent variables, and b0, b1, b2, ..., bn
are the coefficients that represent the relationship between the variables.
To determine the values of the coefficients, multiple linear regression uses a technique called least
squares regression, which finds the line that minimizes the sum of the squared differences between
the actual y-values and the predicted y-values based on the line.
Multiple linear regression can be used to model complex relationships between multiple variables,
and to make predictions based on those relationships. It is commonly used in various fields such as
economics, finance, and social sciences, where multiple factors may influence an outcome.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Multiple linear regression can be used in a variety of scenarios where there are multiple independent
variables that may be influencing a dependent variable. Here are some examples:
1. Predicting real estate prices: Multiple linear regression can be used to model the relationship
between various factors, such as location, square footage, number of bedrooms and bathrooms, and
proximity to amenities, and the price of a home.
2. Analyzing customer satisfaction: Multiple linear regression can be used to analyze the relationship
between various factors, such as wait time, quality of service, and price, and overall customer satisfaction in a retail or service setting.
3. Predicting academic performance: Multiple linear regression can be used to model the relationship between various factors, such as attendance, study habits, and time management, and academic
performance in a college or university setting.
4. Forecasting stock prices: Multiple linear regression can be used to model the relationship between
various factors, such as earnings, dividends, interest rates, and economic indicators, and the price of
a stock.
These are just a few examples of scenarios where multiple linear regression might be used. In general, multiple linear regression can be a useful tool whenever there are multiple independent variables
that may be influencing a dependent variable, and when the relationships between those variables
can be modeled with a linear equation.
Here is an example of sample data for using multiple linear regression analysis using Excel:
Sales
100
200
300
400
500
600
700
800
900
1000
Advertising Spend
10
20
30
40
50
60
70
80
90
100
Store Size
Number of Employees
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
5
6
7
8
9
10
11
12
13
14
Here are the steps to perform multiple linear regression using Excel:
1. Open Microsoft Excel and create a new workbook.
2. Enter the sample data into four columns, with the dependent variable (Sales) in one column and
the independent variables (Advertising Spend, Store Size, and Number of Employees) in the other
columns.
Mastering Statistical Analysis with Excel
298
3. Select the data range by highlighting all four columns.
4. Click on the “Data” tab in the ribbon menu and then click on the “Data Analysis” button.
5. If you don’t see “Data Analysis” in the list, you may need to install it. To do this, click on “File” >
“Options” > “Add-ins” > “Manage: Excel Add-ins” > “Go” > check “Analysis ToolPak” and click “OK”.
6. In the “Data Analysis” dialog box, select “Regression” from the list and click “OK”.
7. In the “Regression” dialog box, enter the input range for the independent variables (Advertising Spend, Store Size, and Number of Employees) and the output range for the dependent variable
(Sales).
8. Check the box next to “Labels” if your data has column labels.
9. Select the “Output Range” to determine where the regression analysis results will appear.
10. Check the boxes next to “Residuals” and “Line Fit Plots” to get additional regression diagnostics.
11. Click “OK” to run the regression analysis.
12. Excel will generate a new table with the regression coefficients, the standard error of the coefficients, the t-statistics, the p-values, and the R-squared value.
13. To interpret the results, examine the coefficients for each independent variable to see how they
relate to the dependent variable. For example, if the coefficient for Advertising Spend is positive and
statistically significant, it suggests that increasing advertising spend will lead to an increase in sales.
14. To use the multiple linear regression equation to make predictions, enter new values for the independent variables into a new row in the worksheet, and then use the regression equation to calculate
the predicted value for the dependent variable based on those values.
That’s it! By following these steps, you can use Excel to perform multiple linear regression and analyze the relationships between multiple independent variables and a dependent variable.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Result displayed
Mastering Statistical Analysis with Excel
300
20
Sampling
I
n statistics, sampling refers to the process of selecting a subset of individuals or units from a larger
population. The purpose of sampling is to gather information about the population by studying a
smaller, more manageable group of individuals or units.
Sampling is commonly used in research to make inferences about a population based on the characteristics of a smaller sample. This is because it is usually impossible or impractical to study an entire
population. Sampling allows researchers to study a representative subset of the population and make
inferences about the larger population based on the results of the sample.
There are different types of sampling methods, including:
1. Random sampling: where individuals or units are selected randomly from the population, with each
individual or unit having an equal chance of being selected.
2. Stratified sampling: where the population is divided into strata (subgroups) based on a particular
characteristic, and individuals or units are randomly selected from each stratum.
3. Cluster sampling: where the population is divided into clusters (groups) and a random sample of
clusters is selected, with all individuals or units in the selected clusters being included in the sample.
4. Convenience sampling: where individuals or units are selected based on their availability or convenience, rather than randomly.
The type of sampling method used will depend on the research question, the characteristics of the
population, and the resources available. Proper sampling techniques are crucial for ensuring that the
sample is representative of the population and that the results of the study can be generalized to the
larger population.
Random Sampling:
Random sampling is a sampling technique where each member of the population has an equal chance
of being selected for the sample. It is a common method used in statistics to obtain a representative
sample from a larger population.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
Performing Random Sampling using Excel:
To perform random sampling using Excel, you can use the RAND function and a formula that selects
random numbers within a range. Here are the steps to follow:
1. Open a new Excel worksheet and enter the list of population members in a column.
2. Decide on the sample size you want to select and enter that number in a cell (e.g., cell A1).
3. In the next column, enter the formula “=RAND()” in the first row of the column.
4. Copy the formula down the column to generate a random number for each population member.
5. Sort the data by the random number column by selecting the entire data range, clicking on “Data”
on the Excel ribbon, and then choosing “Sort”.
6. In the “Sort” dialog box, choose the column containing the random numbers as the sorting criteria
and select “Smallest to Largest” as the sort order.
7. Select the first n rows of the sorted data, where n is the sample size you want to select (e.g., if you
want a sample size of 50, select the first 50 rows).
8. The selected rows represent your random sample from the population.
Note that this method assumes that the population members are listed in a random order. If the population members are not listed in a random order, you may need to shuffle the list before generating the
random numbers. Additionally, if you have a large population, you may need to use a more efficient
method for generating random samples, such as a random number generator in a statistical software
package.
Example:
Here’s an example of sample data and the steps to sample the data using Excel:
Suppose we have a population of 100 students and we want to randomly select a sample of 20 students
for a survey. The population data is stored in a column named “Student ID” in cells A2:A101, and we
want to store the sample data in a new worksheet.
Here are the steps to randomly sample the data using Excel:
1. Open a new worksheet and enter the column headers for the sample data. In this example, we will
use “Student ID” as the only column.
2. In cell A2 of the new worksheet, enter the formula “=RAND()”. This generates a random number
between 0 and 1 for each row.
Mastering Statistical Analysis with Excel
302
3. Copy the formula down to cell A21. This generates a random number for each row of the sample
data.
4. Sort the data by the random number column. Select the range A2:A21, then click on the “Data” tab
on the Excel ribbon and choose “Sort”.
5. In the “Sort” dialog box, choose “Column A” as the sort by criteria and select “Smallest to Largest”
as the sort order. Click “OK” to sort the data.
6. The first 20 rows of the sorted data represent your random sample. Copy these rows and paste
them into a new worksheet to store your sample data.
7. You can now analyze the sample data to draw conclusions about the population.
Note: Excel’s RAND() function recalculates each time you make a change to the worksheet, so the
random numbers will change each time you make a change to the worksheet. If you want to keep the
same random numbers, you can copy and paste the column of random numbers as values, using the
“Paste Special” feature.
Image showing use of sort function in Excel
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Using Data Analysis tool of Excel for sampling data:
Data Analysis tab under Data is clicked. It opens up the Data Analysis window.
Image showing Data Analysis window open where Sampling is chosen
On clicing the OK button Sampling window opens up.
Image showing the sampling window
Mastering Statistical Analysis with Excel
304
In the input range field cursor is placed and mouse is clicked. When the cursor starts to blink the
column that contains the data is selected. As soon as the column is selected the corresponding cell
addresses are automatically entered into the input range field. If the first cell contains label and it has
been included in the selection then the box in front of Lables should be checked.
The next field is the Sampling method field. If the user desires random selection then Random radio
button should be selected. If the user desires periodic sampling then periodic radio button is selected. When periodic button is selected it will open up a field where the user can specify a number. If
the user specifies 3 then every third element in the data set will be included in the sample.
The next field is the number of samples. Here the user can select the number of samples that needs
to be taken.
In the output options portion of the sampling window the user has the choice of displaying the sampled data in the same work sheet or in a different worksheet.
Image showing randomly selected 5 data from the data set displayed in a separate spread sheet.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Sampling errors:
Sampling errors are errors that occur when a sample of data is used to make inferences about a larger
population. These errors arise due to the fact that a sample is only a subset of the population and
therefore may not perfectly represent the population.
Sampling errors can arise due to a variety of factors, including the size of the sample, the method
used to select the sample, and the variability of the population. For example, if a small sample size
is used, the sample may not be representative of the population, and therefore the inferences drawn
from the sample may be inaccurate. Similarly, if a biased sampling method is used, such as only sampling from a particular region or demographic group, the sample may not accurately represent the
population as a whole.
It’s important to note that sampling errors are different from non-sampling errors, which are errors
that arise from factors other than the sampling process, such as errors in measurement or data entry.
How to reduce sampling errors?
To reduce sampling errors, researchers can take several steps, including:
1. Increasing the sample size: A larger sample size will typically provide a more representative sample
of the population, and reduce the margin of error in estimates.
2. Using a random sampling method: Random sampling methods, such as simple random sampling
or stratified random sampling, can help ensure that every member of the population has an equal
chance of being selected for the sample.
3. Using a diverse sample: To ensure that the sample accurately reflects the population, it’s important
to use a sample that is diverse in terms of age, gender, race, and other relevant variables.
4. Reducing non-response bias: Non-response bias occurs when certain members of the population
are less likely to respond to the survey, and can lead to an unrepresentative sample. Researchers can
reduce non-response bias by using strategies such as offering incentives or following up with non-respondents.
5. Conducting a pilot study: A pilot study can help identify potential issues with the sampling method or survey instrument before conducting the main study.
By taking these steps, researchers can reduce the likelihood of sampling errors and improve the accuracy of their estimates about the population.
Mastering Statistical Analysis with Excel
306
Performing stratified sampling using Excel involves several steps:
1. Determine the population: Define the population you want to sample from and identify the relevant strata. Strata are groups within the population that share similar characteristics.
2. Determine the sample size: Determine the desired sample size for each stratum based on the proportion of the population that each stratum represents.
3. Create a data table: Create a table in Excel that includes the population data and the stratum labels.
4. Calculate the stratum size: Calculate the size of each stratum by multiplying the population size by
the proportion of the population that each stratum represents.
5. Randomly select samples: Use the “RAND” function in Excel to generate a random number for
each row of data. Sort the data table by the random number column and select the desired number of
rows for each stratum.
6. Calculate sampling weights: Calculate the sampling weights for each stratum by dividing the desired sample size by the actual sample size for each stratum.
7. Calculate estimates: Calculate estimates for the population by weighting the data for each stratum
based on the sampling weights and aggregating the results.
Here’s an example of how to perform stratified sampling using Excel:
Suppose you want to conduct a survey on the opinions of employees in a company with 3 departments: Sales, Marketing, and Finance. The population of each department is 200, 300, and 500,
respectively. You want to sample 20% of employees from each department.
1. Define the population: The population is the employees in the company, and the relevant strata are
the Sales, Marketing, and Finance departments.
2. Determine the sample size: The desired sample size for each stratum is 20% of the population size
for each department: 40 employees for Sales, 60 employees for Marketing, and 100 employees for
Finance.
3. Create a data table: Create a table in Excel with columns for employee ID, department, and survey
response.
4. Calculate the stratum size: Calculate the size of each stratum by multiplying the population size
by the proportion of the population that each stratum represents: 40 for Sales, 60 for Marketing, and
100 for Finance.
5. Randomly select samples: Use the “RAND” function in Excel to generate a random number for
each row of data. Sort the data table by the random number column and select the desired number of
rows for each stratum: 40 rows for Sales, 60 rows for Marketing, and 100 rows for Finance.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
6. Calculate sampling weights: Calculate the sampling weights for each stratum by dividing the desired sample size by the actual sample size for each stratum: 1.0 for Sales, 0.67 for Marketing, and 0.2
for Finance.
7. Calculate estimates: Calculate estimates for the population by weighting the data for each stratum
based on the sampling weights and aggregating the results. For example, to calculate the average
opinion of employees in the company, you would calculate the weighted average of the survey responses, where the weights are the sampling weights for each stratum.
Here is a sample data set for stratified sampling:
ID
Gender
Age
Stratum
1
2
3
4
5
6
7
8
9
10
11
12
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
30
35
40
45
50
55
60
65
70
75
80
85
A
A
A
A
B
B
B
B
C
C
C
C
In this example, we have a population of 12 individuals. The population is stratified into 3 strata
based on age, with Stratum A representing individuals between 30 and 45 years old, Stratum B representing individuals between 50 and 65 years old, and Stratum C representing individuals between
70 and 85 years old. We want to select a sample of 6 individuals from the population using stratified
sampling.
To perform stratified sampling using Excel, follow these steps:
1. Determine the sample size for each stratum: In this example, we want to sample 2 individuals from
each stratum, since we want a total sample size of 6.
2. Create a new column for the stratum weights: In Excel, add a new column next to the Stratum
column and label it “Stratum Weight”. Enter the desired sample size for each stratum in this column.
In this example, enter 2 for each row in the Stratum Weight column.
3. Create a new column for the random number: Add a new column next to the Stratum Weight column and label it “Random Number”. Use the RAND function to generate a random number for each
row in this column. To do this, enter “=RAND()” in the first cell of the Random Number column
and drag it down to generate a random number for each row.
Mastering Statistical Analysis with Excel
308
4. Sort the data by stratum and random number: Sort the data by the Stratum column first, and then
by the Random Number column. To do this, select the entire data set and go to the “Data” tab in
the ribbon, then click “Sort”. Select “Stratum” as the first sort column and “Random Number” as the
second sort column.
5. Select the sample: Select the desired number of rows for each stratum based on the sample size for
each stratum. To do this, simply select the top 2 rows for Stratum A, the top 2 rows for Stratum B,
and the top 2 rows for Stratum C.
In this example, the selected sample would be:
ID
Gender
Age
1
2
5
6
Male
Female
Male
Female
30
35
50
55
Stratum
A
A
B
B
Stratum Weight
2
2
2
2
Random Number
0.319557373
0.523163533
0.548813503
0.715189366
Image showing data entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing stratum weight column filled and another column Random number created and
formula entered
Mastering Statistical Analysis with Excel
310
Image showing the Random Number column filled using the autofill featured by pulling the
autofill handle.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the data sorted and stratified
Mastering Statistical Analysis with Excel
312
Cluster sampling:
Cluster sampling is a type of sampling method in which the population is divided into clusters, and
a random sample of clusters is selected for analysis. Then, data is collected from all individuals in the
selected clusters. This method is often used when the population is large and widely dispersed, making it impractical or impossible to sample each individual in the population.
In cluster sampling, the clusters are usually formed based on some natural grouping or geographical
location. For example, if a researcher wants to study the prevalence of a particular disease in a country, they may divide the country into regions and randomly select a few regions for sampling. Then,
they would collect data on the disease from all individuals in the selected regions.
Cluster sampling can be used in various fields, including social sciences, epidemiology, and market
research. It is often used in situations where it is difficult or impractical to obtain a complete list of
the population or to sample each individual. Cluster sampling can also be more cost-effective than
other sampling methods, as it requires fewer resources to sample clusters rather than individuals.
One potential drawback of cluster sampling is that it can introduce additional sources of variation in
the sample, as individuals within the same cluster may be more similar to each other than individuals in different clusters. Therefore, it is important to carefully select clusters that are representative of
the population and to account for the clustering effect in the analysis of the data.
Here is an example of how to perform cluster sampling using Excel:
1. Identify the population: In this example, let’s assume the population is all students in a university.
2. Define the clusters: The clusters could be the various departments within the university, such as
engineering, business, humanities, etc.
3. Determine the sample size: The sample size depends on the desired level of precision and confidence. Let’s assume a sample size of 100.
4. Randomly select clusters: Using Excel’s random number generator function, select 10 clusters from
the list of departments.
5. Sample all individuals within the selected clusters: Once the clusters have been selected, sample all
individuals within those clusters. For example, if the engineering department is selected, sample all
students within that department.
To perform these steps in Excel, follow these instructions:
1. Create a list of all clusters: In this case, list all the departments in the university in one column.
2. Use Excel’s random number generator to select clusters: In a separate column, use the RAND()
function to generate a random number for each department. Then, use the RANK() function to rank
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
the departments from smallest to largest random number. Finally, select the top 10 departments
from the list.
3. Sample all individuals within selected clusters: Once the 10 clusters have been selected, sample all
individuals within those clusters. This can be done manually or by using the random number generator function again.
4. Analyze the data: Once the sample has been collected, analyze the data as appropriate for the research question or hypothesis.
Note that there are other sampling techniques available, and the appropriate technique depends on
the research question and the population being studied.
Convenience sampling:
Convenience sampling is a non-probability sampling technique where participants are selected based
on their availability, accessibility, and willingness to participate in a study. In convenience sampling,
the researcher selects participants who are easy to reach or who happen to be in the right place at the
right time, such as those who are nearby or those who are friends or acquaintances of the researcher.
Convenience sampling is commonly used in exploratory or preliminary studies where the goal is to
gather initial data quickly and inexpensively. However, convenience sampling is not a representative
sampling method, and the sample may not be representative of the larger population. Therefore, the
results of a convenience sample cannot be generalized to the larger population, and the findings may
be biased and not be reliable or accurate.
Convenience sampling is generally not considered a rigorous sampling method, and it is not recommended for studies where generalization to a larger population is important. Other probability
sampling methods, such as simple random sampling or stratified sampling, are preferred when generalization is required.
Mastering Statistical Analysis with Excel
314
T-Test: Two sample assuming
equal variances
21
A
t-test is a statistical hypothesis test used to determine if there is a significant difference between the
means of two groups. The t-test is used to compare the means of two groups, and it is particularly
useful when the sample size is small (typically less than 30) or when the population standard deviation is
unknown.
The t-test: Two sample assuming equal variances is a type of t-test used when the variances of the two
groups being compared are assumed to be equal. This assumption is known as the homogeneity of
variance assumption. When the variances are equal, the t-test uses a pooled variance estimate, which
is the weighted average of the two sample variances, to calculate the test statistic.
The formula for the t-test: Two sample assuming equal variances is:
t = (x1 - x2) / (s * sqrt(2/n))
where:
. t is the test statistic
. x1 and x2 are the sample means of the two groups being compared
. s is the pooled standard deviation
. n is the sample size of each group
To perform a t-test: Two sample assuming equal variances in Excel, you can use the TTEST function.
The syntax for the TTEST function is:
=TTEST(array1,array2,tails,type)
where:
. array1 is the first group of data
. array2 is the second group of data
. tails is the number of tails for the test (1 or 2)
. type is the type of t-test (1 for paired data, 2 for two-sample equal variance, and 3 for two-sample
unequal variance)
Prof Dr Balasubramanian Thiagarajan MS D.L.O
The TTEST function will return the probability (p-value) that the means of the two groups are equal.
If the p-value is less than the chosen significance level (usually 0.05), then the null hypothesis (that the
means are equal) is rejected in favor of the alternative hypothesis (that the means are different).
Indications:
The T-Test: Two sample assuming equal variances is used when there are two groups being compared,
and the researcher assumes that the variances of the two groups are equal. This assumption means that
the two groups being compared have the same variability or spread of scores. The test is used to determine if there is a significant difference between the means of the two groups.
The T-Test: Two sample assuming equal variances can be used in a variety of situations, such as:
1. Clinical trials: The test can be used to determine if there is a significant difference between the effectiveness of two treatments or interventions.
2. Business research: The test can be used to determine if there is a significant difference in performance or productivity between two groups of employees, or if there is a significant difference in customer satisfaction between two products.
3. Educational research: The test can be used to determine if there is a significant difference in test
scores between two groups of students or if there is a significant difference in the effectiveness of two
teaching methods.
4. Social research: The test can be used to determine if there is a significant difference in attitudes or
opinions between two groups of people.
In general, the T-Test: Two sample assuming equal variances is used when there are two groups being
compared, and the researcher assumes that the variances of the two groups are equal. If the assumption of equal variances is violated, then a different test, such as the Welch’s t-test or the Mann-Whitney
U test, may be more appropriate.
Sample data:
Let’s consider the following example to illustrate the use of T-Test: Two sample assuming equal variances:
Suppose we want to determine if there is a significant difference in the mean test scores of two groups
of students, Group A and Group B. We randomly select 10 students from each group and record their
test scores as follows:
Group A: 78, 83, 70, 76, 85, 72, 90, 79, 88, 80
Group B: 75, 82, 68, 73, 81, 69, 85, 76, 84, 77
Steps to perform T-Test: Two sample assuming equal variances in Excel:
Mastering Statistical Analysis with Excel
316
1. Enter the data into two separate columns in an Excel worksheet.
2. Calculate the mean and standard deviation of each group using the AVERAGE and STDEV.S functions. In our example, the mean and standard deviation of Group A are 80.1 and 6.69, respectively,
and the mean and standard deviation of Group B are 76.1 and 6.05, respectively.
3. Calculate the pooled standard deviation using the following formula:
s = sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
where:
. s is the pooled standard deviation
. n1 and n2 are the sample sizes of the two groups
. s1 and s2 are the sample standard deviations of the two groups
In our example, the pooled standard deviation is:
s = sqrt(((10 - 1) * 6.69^2 + (10 - 1) * 6.05^2) / (10 + 10 - 2)) = 6.37
4. Use the TTEST function to calculate the test statistic and p-value. The syntax for the TTEST function is:
=TTEST(array1,array2,tails,type)
In our example, the formula would be:
=TTEST(A2:A11,B2:B11,2,2)
where:
. A2:A11 is the range of data for Group A
. B2:B11 is the range of data for Group B
. 2 indicates a two-tailed test
. 2 indicates a T-Test: Two sample assuming equal variances
The TTEST function returns a p-value of 0.033, which is less than the commonly used significance
level of 0.05. Therefore, we can conclude that there is a significant difference in the mean test scores
of Group A and Group B.
5. Interpret the results. In our example, since the p-value is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in the mean test scores of the two groups.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Data entered into columns
Image showing Average of first column calculated by entering the formula
Mastering Statistical Analysis with Excel
318
Image showing average for both columns calculated
Image showing formula for calculating standard deviation of Group A entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing standard deviation value of Group A displayed on pressing Enter Key
Image showing Standard deviation of Both groups displayed
Mastering Statistical Analysis with Excel
320
Image showing formula for calculating T test entered
Image showing T test result displayed (highlighted in green)
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
The same calculation can also be performed using Data Analyser function of Excel. Data Analysis
tab could be seen listed under data tab. On clicking Data Analysis tab the Data Analysis window
opens. In this window T-Test: Two sample assuming equal variances is chosen and Ok button is
clicked. This opens up the T-Test: Two sample assuming equal variances window.
In this window, cursor is placed in the variable 1 range field and clicked. When it starts to blink
the first column including the heading the first column Group A is selected along with the header.
When this column is selected the cell addresses are automatically entered into this field.
The cursor is next placed over the variable 2 range field and clicked. When it starts to blink the
second column containing Group B values are selected including the header. On being selected the
cell addresses are entered into this field. Since the lables are included then the box infront of Labels
is ticked. In the output options field New worksheet radio button is checked. On clicking OK button
the result gets displayed.
Image showing T test window
Mastering Statistical Analysis with Excel
322
Image showing result displayed
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Mastering Statistical Analysis with Excel
324
22
T-Test: Paired two sample for
means
A
paired two-sample t-test is a statistical test used to compare the means of two dependent or related
samples. In this test, the same group of subjects or items are tested twice, and the results are compared to determine if there is a significant difference between the means of the two sets of measurements.
For example, suppose a researcher is interested in comparing the effectiveness of two different medications for treating a particular condition. Rather than assigning different groups of subjects to receive
each medication, the researcher could use a paired two-sample t-test to compare the effects of the two
medications on the same group of subjects.
The test is called “paired” because each observation in one sample is matched with a corresponding
observation in the other sample. The matching is done to ensure that any differences observed between the two samples are not due to individual differences in the subjects, but rather to the treatment
being compared.
The paired two-sample t-test assumes that the differences between the paired observations are normally distributed, and that the variances of the two samples are equal. If these assumptions are met,
the t-test can be used to determine if the difference between the means of the two samples is statistically significant.
The paired two-sample t-test is used to compare the means of two related groups or samples. It is
typically used when the samples are paired, meaning that each observation in one sample is uniquely paired with a corresponding observation in the other sample. Here are some examples of when a
paired two-sample t-test might be appropriate:
1. Before and after measurements: If you measure a group of individuals before and after a treatment,
you can use a paired two-sample t-test to determine whether the treatment had a significant effect.
2. Matched pairs: If you have two groups of individuals that are matched on a particular variable (e.g.,
age, gender, or BMI), you can use a paired two-sample t-test to compare their means on another variable of interest.
3. Repeated measures: If you measure the same individuals multiple times over a period of time, you
can use a paired two-sample t-test to determine whether there is a significant change in the variable of
Prof Dr Balasubramanian Thiagarajan MS D.L.O
interest over time.
In general, a paired two-sample t-test is appropriate when you want to compare the means of two
related groups or samples and you have reason to believe that the differences between the two groups
are normally distributed. Additionally, the samples should be approximately equal in size and the data
should be continuous or at least approximately normally distributed.
Here’s an example of a paired two-sample t-test:
Suppose you are conducting a study to determine whether a new exercise program is effective at reducing blood pressure. You recruit 20 participants with high blood pressure and measure their blood
pressure before and after the exercise program. The data is shown below:
Participant
Before Exercise
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
140
132
145
160
148
152
136
142
130
144
158
147
138
146
151
136
145
132
139
158
After Exercise
130
126
138
155
145
148
132
135
124
139
154
141
135
142
148
132
139
127
133
153
Difference
-10
-6
-7
-5
-3
-4
-4
-7
-6
-5
-4
-6
-3
-4
-3
-4
-6
-5
-6
-5
To perform a paired two-sample t-test on this data using Excel, follow these steps:
Enter the data into Excel in two columns, one for the before exercise measurements and one for the
after exercise measurements. Be sure to include a column for the difference between the two measurements.
Calculate the mean and standard deviation of the differences in blood pressure. To do this, use the
AVERAGE and STDEV.S functions in Excel.
Mastering Statistical Analysis with Excel
326
Calculate the t-statistic using the formula: t = (mean difference) / (standard deviation of the differences / sqrt(n)), where n is the number of pairs of measurements.
Determine the degrees of freedom (df) using the formula: df = n - 1.
Determine the p-value using Excel’s T.DIST.RT function. This function returns the probability of
observing a t-value as extreme as the one calculated in step 3, given the degrees of freedom calculated in step 4.
Finally, interpret the results. If the p-value is less than your chosen level of significance (e.g., 0.05),
then you can reject the null hypothesis and conclude that the exercise program had a significant
effect on blood pressure.
Note that Excel also has a built-in function for performing paired two-sample t-tests called T.TEST.
This function takes two arguments: the range of the before exercise measurements and the range of
the after exercise measurements. However, it’s still important to understand the underlying calculations and formulas to ensure that you’re using the correct statistical test for your data.
Image showing data entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing T calculated
Mastering Statistical Analysis with Excel
328
23
T-Test: Two-sample assuming
Unequal variances
A
t-test is a statistical hypothesis test used to compare the means of two groups or samples. Specifically, a two-sample t-test assuming unequal variances is used when the variances of the two groups are
assumed to be different.
In this type of t-test, we first calculate the sample means and standard deviations for both groups, and
then calculate the t-statistic by taking the difference between the two means and dividing it by a measure of the variability of the data, known as the standard error. The standard error takes into account
both the sample sizes and the sample variances of the two groups.
The t-statistic is then compared to a critical value from a t-distribution with degrees of freedom equal
to the smaller of the two sample sizes minus one. If the t-statistic is larger than the critical value, we
reject the null hypothesis that the means of the two groups are equal, and conclude that there is a significant difference between the means of the two groups. Otherwise, we fail to reject the null hypothesis.
The two-sample t-test assuming unequal variances is a commonly used statistical test in many fields,
including social sciences, business, and engineering, among others.
Indications:
The two-sample t-test assuming unequal variances is typically used when we have two independent
samples and we want to compare their means. Specifically, it is used when:
1. The two samples are independent of each other, meaning that there is no relationship between the
individuals in one sample and the individuals in the other sample.
2. The populations from which the samples are drawn are normally distributed, or the sample sizes are
sufficiently large (i.e., greater than 30) so that the Central Limit Theorem applies.
3. The variances of the two populations are not assumed to be equal. This is a key difference from the
two-sample t-test assuming equal variances, which assumes that the variances of the two populations
are equal.
4. The data are measured on at least an interval scale, meaning that the differences between values
Prof Dr Balasubramanian Thiagarajan MS D.L.O
have meaning and the scale has a meaningful zero point.
The two-sample t-test assuming unequal variances is commonly used in research studies to test hypotheses about the differences between two groups on a continuous outcome variable. For example, it
may be used to compare the mean test scores of students from two different schools or to compare the
mean blood pressure levels of patients in two different treatment groups.
Sample data for using T-Test: Two-sample assuming Unequal variances:
Suppose we want to compare the average weight of apples produced by two different orchards. We
collect a random sample of 10 apples from each orchard and weigh them. The data are as follows:
Orchard A: 160, 170, 175, 155, 165, 180, 170, 185, 165, 175
Orchard B: 150, 155, 165, 145, 170, 155, 165, 160, 175, 170
To perform a two-sample t-test assuming unequal variances using Excel, we can follow these steps:
1. Enter the data into two separate columns in an Excel worksheet.
2. Calculate the sample means and standard deviations for each group using the AVERAGE and STDEV functions.
3. Calculate the degrees of freedom for the t-test using the smaller of the two sample sizes minus one.
4. Calculate the standard error of the difference between the means using the formula: sqrt[(s1^2/n1)
+ (s2^2/n2)], where s1 and s2 are the sample standard deviations, and n1 and n2 are the sample sizes.
5. Calculate the t-statistic using the formula: (x1 - x2) / SE, where x1 and x2 are the sample means, and
SE is the standard error of the difference between the means.
6. Calculate the p-value for the t-statistic using the T.DIST.2T function in Excel. This function calculates the probability of getting a t-value as extreme or more extreme than the observed t-value assuming a two-tailed distribution.
7. Finally, compare the p-value to the level of significance (e.g., α = 0.05) to determine whether to
reject or fail to reject the null hypothesis that the means of the two groups are equal.
Note that Excel also has a built-in function for performing a two-sample t-test assuming unequal variances, called the T.TEST function. This function takes the two sets of data as input, as well as whether
the test is one-tailed or two-tailed, and returns the t-value and p-value for the test.
Mastering Statistical Analysis with Excel
330
Image showing data entered
Image showing t Test Two sample assuming unequal variances
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the result displayed
Mastering Statistical Analysis with Excel
332
24
z-Test: Two Sample for Means
T
he z-test for two sample means is a statistical hypothesis test that is used to compare the means of
two independent samples to determine whether they come from the same population or different
populations. It is based on the standard normal distribution and assumes that the population variances
are known.
The test works by calculating the difference between the means of the two samples and dividing it by
the standard error of the mean. This gives a z-score which is compared to the critical value from the
standard normal distribution. If the z-score is larger than the critical value, then we reject the null hypothesis that the means are the same and conclude that the samples come from different populations.
The null hypothesis in this test is that there is no significant difference between the means of the two
populations, while the alternative hypothesis is that there is a significant difference between them. The
level of significance is typically set at 0.05 or 0.01.
The z-test for two sample means is commonly used in business, economics, and social sciences to
compare the means of two groups or populations, for example, to determine whether a new marketing strategy has a significant impact on sales or to compare the effectiveness of two different medical
treatments.
A scenario where the z-test for two sample means can be used is in a clinical trial to compare the effectiveness of two different medications for a particular health condition.
Suppose that there are two groups of patients, one group receives medication A and the other group
receives medication B. The mean improvement in the health condition for each group is measured
after a specified time period, and we want to determine whether there is a significant difference in the
effectiveness of the two medications.
To conduct the z-test for two sample means, we would first calculate the difference between the mean
improvement in the health condition for the two groups. We would then calculate the standard error of the mean, assuming that the population variances are known. Finally, we would calculate the
z-score and compare it to the critical value from the standard normal distribution at a specified level
of significance, such as 0.05.
If the calculated z-score is larger than the critical value, we would reject the null hypothesis that the
mean improvement in the health condition for the two groups is the same, and conclude that there is a
Prof Dr Balasubramanian Thiagarajan MS D.L.O
significant difference in the effectiveness of the two medications. This information can then be used to
make a decision on which medication is more effective and should be recommended to patients.
A scenario where the z-test for two sample means can be used is in a clinical trial to compare the effectiveness of two different medications for a particular health condition.
Suppose that there are two groups of patients, one group receives medication A and the other group
receives medication B. The mean improvement in the health condition for each group is measured
after a specified time period, and we want to determine whether there is a significant difference in the
effectiveness of the two medications.
To conduct the z-test for two sample means, we would first calculate the difference between the mean
improvement in the health condition for the two groups. We would then calculate the standard error of the mean, assuming that the population variances are known. Finally, we would calculate the
z-score and compare it to the critical value from the standard normal distribution at a specified level
of significance, such as 0.05.
If the calculated z-score is larger than the critical value, we would reject the null hypothesis that the
mean improvement in the health condition for the two groups is the same, and conclude that there is a
significant difference in the effectiveness of the two medications. This information can then be used to
make a decision on which medication is more effective and should be recommended to patients.
Here’s a sample dataset that you can use to perform a two-sample Z-test for means:
Group 1: {2, 5, 8, 12, 15}
Group 2: {6, 9, 11, 13, 17}
Assuming that these two groups are independent and the population standard deviation is unknown,
you can use a two-sample Z-test to determine whether the means of these two groups are significantly
different from each other.
Here are the steps to perform the test in Excel:
1. Enter the data for both groups into separate columns in an Excel worksheet.
2. Calculate the mean, standard deviation, and sample size for each group using the appropriate Excel
formulas.
3. Calculate the difference between the means of the two groups.
4. Calculate the standard error of the difference using the formula:
5. Standard Error = SQRT((S1^2 / n1) + (S2^2 / n2))
where S1 and S2 are the sample standard deviations of Group 1 and Group 2, and n1 and n2 are the
sample sizes of Group 1 and Group 2, respectively.
Mastering Statistical Analysis with Excel
334
Calculate the Z-score using the formula:
Z = (X1 - X2) / SE
where X1 and X2 are the means of Group 1 and Group 2, respectively, and SE is the standard error of
the difference calculated in step 4.
6. Determine the p-value associated with the Z-score using the appropriate Excel function, such as
the NORM.S.DIST or NORM.DIST function.
7. Determine the level of significance (alpha) for the test. This is typically set to 0.05.
8. Compare the p-value to the level of significance. If the p-value is less than alpha, then the difference between the means of the two groups is statistically significant. If the p-value is greater than
alpha, then there is no significant difference between the means of the two groups.
Note that these steps assume that the data is normally distributed and that the sample sizes are sufficiently large (typically n > 30). If the data is not normally distributed or the sample sizes are small,
then a different test may be more appropriate.
Image showing data entered and mean of Group 1 data is calculated using the formula entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing mean values of both groups calculated 8.4 and 11.2 respectively
Image showing Standard deviation for both groups calculated using STDEV.S function
Mastering Statistical Analysis with Excel
336
Image showing standard deviation for both groups calculated.
Image showing Z Test:Two sample for Means screen
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
In the Variable 1 input field cursor is placed in the field and clicked. When the cursor starts to blink
the cells containing Group 1 data are selected. On selection of these cells their addresses are automatically entered into variable 1 field.
The cursor is next placed in the Variable 2 input field and clicked. When the cursor starts to blink
the values under Group 2 are selected. Automatically the cell addresses containing the data can be
found entered into the Variable 2 field. If the label of the column are selected then the box infront of
Labels should be checked. The alpha values are left at their default.
In the output field options the cells where the user wants to Excel to display the results are chosen.
On choosing the addresses of these cells are automatically found to be entered into this field. On
clicking the OK button the results could be seen displayed.
Image showing the result displayed
Mastering Statistical Analysis with Excel
338
25
Pivot Table
Introduction:
Pivot tables in Excel can be a powerful tool for statistical analysis because they allow you to summarize and analyze large amounts of data quickly and easily. Here are some steps to follow:
1. First, organize your data in a table with columns and rows. Make sure that the data is clean and
there are no missing values.
2. Select the data range that you want to analyze, including the column headers.
3. Go to the “Insert” tab on the Excel ribbon and click on “PivotTable”. A dialog box will appear
where you can select the data range and choose where to place the pivot table.
4. In the PivotTable Field List, drag and drop the variables that you want to analyze into the “Rows”
and “Values” sections. For example, if you want to analyze the average salary of employees by department, drag the “Department” variable into the “Rows” section and the “Salary” variable into the
“Values” section.
5. You can also add filters and columns to your pivot table by dragging and dropping variables into
the “Filters” and “Columns” sections.
6. Once you have set up your pivot table, you can use the features in the “Design” and “Analyze”
tabs on the Excel ribbon to format and analyze your data. For example, you can use the “Summarize
Values By” option to choose how to summarize your data (e.g. sum, average, count, etc.), and you
can use the “Group” feature to group your data by specific intervals (e.g. group salaries by $10,000
increments).
Overall, pivot tables in Excel are a versatile tool that can be used to explore and analyze data from
various perspectives. By using pivot tables in statistical analysis, you can gain insights and make
data-driven decisions.
Power of Pivot:
A PivotTable is an interactive tool that provides a quick overview of large data sets. It allows you to
analyze numerical data in detail and answer unexpected questions about the data. PivotTables are
Prof Dr Balasubramanian Thiagarajan MS D.L.O
specifically designed to perform the following tasks:
. Query large amounts of data in various user-friendly ways.
. Summarize numeric data by categories and subcategories, subtotaling and aggregating data, and creating custom calculations and formulas.
. Expand and collapse levels of data to focus on specific results, and drill down to details from the
summary data for areas of interest.
. Pivot rows to columns or columns to rows to see different summaries of the source data.
. Filter, sort, group, and apply conditional formatting to display the most useful and interesting subset
of data, enabling you to focus on specific information.
Present concise, visually appealing, and annotated online or printed reports.
Ways to query large data using Pivot table:
1. Pivot tables are a powerful feature of Excel that allows you to analyze and summarize large amounts
of data quickly and easily. Here are some ways you can use pivot tables to query large amounts of data:
2. Filter data: You can filter data by selecting the fields that you want to include or exclude from the
pivot table. For example, if you have sales data for multiple products, you can filter the data to show
only the sales data for a specific product.
3. Group data: You can group data by categories such as date, month, or year. This allows you to view
the data in a more meaningful way and helps you identify patterns and trends.
4. Calculate summary data: Pivot tables allow you to calculate summary data such as sums, averages,
and counts for different categories. For example, you can calculate the total sales for each product
category.
5. Drill down to details: You can drill down to the details of the data by double-clicking on a cell in the
pivot table. This allows you to see the underlying data that makes up the summary.
6. Create calculated fields: You can create calculated fields based on existing data to perform more
complex calculations. For example, you can calculate the percentage of sales for each product category.
7. Use slicers: Slicers are visual controls that allow you to filter data in a pivot table. They make it easy
to select specific data to display in the pivot table.
Overall, pivot tables are a great tool for querying large amounts of data in Excel. By using the various
features available in pivot tables, you can quickly analyze and summarize large amounts of data and
gain valuable insights.
Mastering Statistical Analysis with Excel
340
Data filtering:
To filter data using pivot table in Excel, follow these steps:
1. Select the data you want to analyze in the pivot table.
2. Go to the “Insert” tab in the Excel ribbon, and click on the “PivotTable” button in the “Tables”
group.
3. In the “Create PivotTable” dialog box, select the range of data you want to include in the pivot
table.
4. Choose where you want the pivot table to be located, and click “OK”.
5. The pivot table will be created in a new sheet. You will see the field list on the right side of the
sheet, which includes all the columns from the data you selected.
6. To filter data, drag one or more columns from the field list to the “Filters” area at the bottom of the
field list.
7. Click the drop-down arrow next to the filter you want to apply, and select the criteria you want to
filter by.
8. Click “OK” to apply the filter.
9. The pivot table will be updated to show only the data that meets the criteria you selected.
You can also use the “Value Filters” option to filter data based on numerical values, such as greater
than or less than a certain number. To do this, select the column you want to filter, and choose “Value Filters” from the drop-down menu.
Overall, filtering data using pivot table is a quick and easy way to analyze data and identify trends or
patterns. By filtering data, you can focus on specific aspects of the data and gain insights that may
not be apparent from the raw data.
Here is a sample dataset that we can use to demonstrate data filtering using pivot table in Excel:
Region Country
Asia
China
Asia
Japan
Asia
Korea
Europe France
Europe Germany
Europe Italy
America USA
America Canada
America Mexico
Sales
100
200
150
300
250
200
400
150
250
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
To filter this data using pivot table, follow these steps:
Select the data in Excel and go to the “Insert” tab in the ribbon.
Click on the “PivotTable” button in the “Tables” group, and select the range of data you want to analyze.
Choose where you want the pivot table to be located, and click “OK”.
In the “PivotTable Fields” pane on the right side of the sheet, drag the “Region” field to the “Filters”
area at the bottom of the pane.
The filter dropdown will appear in the pivot table. Click the dropdown arrow and select the checkbox
next to the region you want to filter by. For example, if you want to see only data for the “Europe”
region, select the “Europe” checkbox.
Click “OK” to apply the filter.
The pivot table will now show only the data for the selected region.
You can also filter data using other fields, such as “Country” or “Sales”. To do this, simply drag the
desired field to the “Filters” area, and select the criteria you want to filter by.
Overall, data filtering using pivot table is a powerful tool for analyzing and summarizing large
amounts of data in Excel. By using filters, you can focus on specific aspects of the data and gain insights that may not be apparent from the raw data.
Grouping data:
To group data using a pivot table, follow these steps:
1. Select the range of data that you want to summarize using a pivot table.
2. In the “Insert” tab on the Excel ribbon, click on “PivotTable” to create a new pivot table.
3. In the “Create PivotTable” dialog box, select the range of data that you want to use for your pivot
table, and choose where you want to place the pivot table (e.g., in a new worksheet or in an existing
one).
4. In the PivotTable Fields task pane, drag the column headings that you want to group into the
“Rows” or “Columns” section of the pivot table.
5. Right-click on the column header that you want to group, and choose “Group” from the context
menu.
6. In the “Grouping” dialog box, specify the range of values that you want to group together. For
Mastering Statistical Analysis with Excel
342
example, you might group a series of dates by month, quarter, or year.
7. Click “OK” to close the dialog box and create your new grouping.
8. Repeat steps 5-7 for any additional columns that you want to group.
9. To summarize your data, drag the column headings that you want to summarize into the “Values”
section of the pivot table. You can choose to summarize data using a variety of functions, such as
sum, count, average, or max/min.
10. Format your pivot table as desired, and save your work.
That’s it! With these steps, you can group data using a pivot table and easily summarize your data in a
variety of ways.
Here’s an example of how to group data using a pivot table in Excel:
Start with a dataset that you want to summarize. For example, let’s say you have a list of sales transactions, with columns for date, product, and sales amount.
Date
1/1/2023
1/2/2023
1/2/2023
1/3/2023
1/4/2023
Product Sales Amount
Widget A
$100
Widget A
$150
Widget B
$75
Widget A
$125
Widget B
$50
1. Select the entire dataset (including headers) and click on “Insert” in the Excel ribbon, and then
click on “PivotTable” to create a new pivot table.
2. In the “Create PivotTable” dialog box, make sure that the range of your data is correct and choose
where you want to place your pivot table (e.g., in a new worksheet).
3. In the PivotTable Fields task pane on the right, drag the “Date” column to the “Rows” section, and
drag the “Product” column to the “Columns” section.
4. Drag the “Sales Amount” column to the “Values” section. By default, Excel will summarize the
values using the “Sum” function.
5. Now, let’s group the dates by month. Right-click on any date in the “Rows” section and select
“Group” from the context menu.
6. In the “Grouping” dialog box, select “Months” and uncheck all the other options. Click “OK” to
close the dialog box and apply the grouping.
7. Your pivot table should now show the sales amount by product and by month:
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
| Widget A | Widget B
---|-------------|---------Jan| $375 | $125
8. You can further customize your pivot table by adding additional fields or changing the summary
function for the sales amount. For example, you could add a “Region” field to the “Filters” section
and filter the pivot table by region.
That’s it! With these steps, you can easily group data using a pivot table and summarize your data in a
variety of ways.
Data Summary:
Data summary refers to the process of analyzing and synthesizing large amounts of data into a
condensed and informative format that allows for quick and easy interpretation. It involves using
statistical and analytical tools to identify patterns, trends, and relationships within the data, and then
presenting these findings in a way that is easily understandable and actionable.
The purpose of data summary is to provide insights into the underlying patterns and trends in the
data, allowing individuals and organizations to make informed decisions and take appropriate actions. Data summary may include measures such as averages, medians, standard deviations, correlations, and regression analyses, among others.
Common methods for summarizing data include creating charts, tables, and graphs, as well as using
pivot tables, dashboards, and reports. The ultimate goal of data summary is to provide a clear and
concise understanding of the data, allowing individuals and organizations to make data-driven decisions and achieve their objectives.
Here’s an example of some sample data you could use to create a pivot table in Excel:
Region
East
East
East
East
East
East
West
West
West
West
Country
USA
USA
USA
USA
Canada
Canada
USA
USA
Canada
Canada
Salesperson
John
John
Sarah
Sarah
Jacques
Jacques
David
David
Juan
Juan
Product
Apples
Bananas
Apples
Bananas
Apples
Bananas
Apples
Bananas
Apples
Bananas
Quantity
100
50
75
25
150
75
120
80
90
30
Revenue
500
250
375
125
750
375
600
400
450
150
Using this data, you could create a pivot table that summarizes the total revenue by region and product. To do this, you would:
Mastering Statistical Analysis with Excel
344
1. Select the data and click on the “Insert” tab in the Excel ribbon.
2. Click on the “PivotTable” button and select the location where you want the pivot table to appear.
3. In the “Create PivotTable” dialog box, make sure the “Select a table or range” option is selected and
that the range matches your data.
4. Choose to create the pivot table in a new worksheet or in the existing worksheet.
5. In the new worksheet, drag the “Region” field to the “Rows” area, and the “Product” field to the
“Columns” area.
6. Drag the “Revenue” field to the “Values” area.
7. The pivot table will automatically sum the revenue by region and product, and display the results
in the cells.
Drilling down to details:
Drilling down to details in a pivot table means expanding the view of the data to show more granular
information. This allows you to see the underlying details that make up the summary data displayed
in the pivot table.
To drill down to details in a pivot table in Excel, follow these steps:
1. Click on the cell containing the value you want to drill down on.
2. Right-click on the cell and select “Show Details” from the context menu.
3. Excel will create a new sheet containing the detailed data that makes up the selected value. This
sheet will contain a table of all the individual data points that were used to calculate the summary
value in the pivot table.
You can also use the “Drill Down” feature in Excel to see details for an entire row or column. To do
this, follow these steps:
1. Click on the row or column label that you want to drill down on.
2. Right-click on the label and select “Drill Down” from the context menu.
3. Excel will create a new sheet containing the detailed data for the selected row or column. This
sheet will contain a table of all the individual data points that make up the row or column.
By drilling down to details in a pivot table, you can gain a deeper understanding of the underlying
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
data and identify patterns or trends that may not be immediately visible in the summary view. This
can help you make more informed decisions and take more targeted actions based on the data.
Creating calculated fields:
You can create calculated fields in a pivot table to perform calculations based on existing data fields.
Calculated fields are useful when you need to perform calculations that are not available in the original data source.
To create a calculated field in a pivot table in Excel, follow these steps:
1. Select any cell in the pivot table.
2. Go to the “PivotTable Analyze” or “Options” tab in the Excel ribbon and click on “Fields, Items, &
Sets” (in older versions of Excel, this may be labeled “Formulas”).
3. Select “Calculated Field” from the drop-down menu.
4. In the “Name” field, enter a name for the calculated field.
5. In the “Formula” field, enter the formula for the calculation you want to perform using the available fields and operators. For example, if you want to calculate the average price per unit, you could
enter “Revenue / Quantity” as the formula.
6. Click “Add” to add the calculated field to the pivot table.
7. The calculated field will appear as a new field in the “Values” area of the pivot table, and will be
calculated based on the formula you entered.
Note that calculated fields are only available for the current pivot table, and are not saved with the
original data source. If you want to use the calculated field in another pivot table or worksheet, you
will need to create it again in that location.
Also, keep in mind that the syntax for calculated fields may differ slightly depending on the version
of Excel you are using. Consult Excel’s documentation for specific instructions and examples for your
version.
Here is an example dataset that you can use to create a calculated field in a pivot table:
Product Category
Sales
Cost
Electronics
Clothing
Beauty
Electronics
Clothing
Beauty
500
400
300
600
700
400
300
200
150
400
350
200
Mastering Statistical Analysis with Excel
346
In this example, we have a dataset that contains sales and cost data for three product categories:
electronics, clothing, and beauty. You can use this dataset to create a pivot table that summarizes the
sales and cost data by product category.
To create a calculated field, let’s say you want to add a new field to calculate the profit margin for each
product category. You can add a calculated field called “Profit Margin” using the following formula:
= (Sales - Cost) / Sales
This formula calculates the profit margin as a percentage by subtracting the cost from the sales and
dividing by the sales.
You can add this calculated field to your pivot table by selecting the “Fields, Items & Sets” option in
the “PivotTable Analyze” tab and choosing “Calculated Field”. Then enter the formula above and click
“Add”. This will add a new column to your pivot table that displays the profit margin for each product
category.
Data Filtering using Pivot Table:
Image showing Data entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Insert tab which on being clicked exposes the Pivot table tab. From the Pivot table
drop down menu From Table / Range is chosen
Image showing Pivot Table from Table or Range dialog box. In the table range the addresses of cells
containing the data are entered.
Mastering Statistical Analysis with Excel
348
Image showing Pivot Tale fields. In this example Data filtering is going to be performed. If the user
desires to filter sales data as per the country then Country should be dragged to the filter field and
sales to the value field as shown.
Image showing the result of filter applied. If the sum of sales of Canada is needed to be accessed then
Canada should be chosen from the list of countries that get displayed on clicking the down arrow
and the sum of sales get displayed as shown
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
By placing a tick mark in select multiple items the user will be able to apply filter to multiple countries. In this image three countries (Canada, France and Germany) have been chosen.
Image showing the sum of sales of the selected countries displayed
Mastering Statistical Analysis with Excel
350
Grouping data using Pivot tables:
Image showing Pivot Tables Fields dialog box. In order to group the dataset the user will have to
pull down the Country to columns box as shown and Region header is pulled into Rows box. Sales
header is pulled into Values box.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing sales data grouped
Image showing data that needs to be grouped entered. It describes sale of products on a given date
Mastering Statistical Analysis with Excel
352
Image showing PivotTable fields dialog. Date field is drawn into Rows and Product field is drawn
into Columns. Sales is drawn into Values field.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing grouping of sum of sales as per month
Image showing right click submenu and group is chosen
Mastering Statistical Analysis with Excel
354
Image showing Grouping dialog box where Months is chosen. On clicking OK button the data
would be grouped and displayed in a month wise fashion.
Image showing sales figures displayed in a monthwise manner
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Data summary using Pivot table:
Image showing fruit sales data regionwise entered. This dataset also shows sales as per sales person
also.
Image showing PivotTables fields. Product is seen drawn into columns field, Region is drawn into
Rows field and Revenue is drawn into Values field
Mastering Statistical Analysis with Excel
356
Image showing Revenue arranged in Region and Product wise.
Drilling down to details using Pivot table:
Image showing Drilling down for details at work. The cell whose details need to be elaborated is
right clicked. In the right click menu show details is clicked.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the result of detailed drilling of a data field.
Pivot Slicers:
Slicers are a visual filtering feature in pivot tables that allow you to filter your data in a more user-friendly way. They are essentially a set of interactive buttons that you can use to quickly filter your
pivot table by certain criteria, such as dates, categories, or regions.
When you add a slicer to a pivot table, you can select one or more values in the slicer to filter the data
in the pivot table. This makes it easy to focus on specific subsets of data without having to manually
adjust the filters in the pivot table. Slicers are particularly useful when you are working with large
data sets that have many different dimensions and need to be able to filter and analyze your data
quickly and easily.
Here’s an example of how you can use a slicer in a pivot table: Let’s say you have a pivot table that
shows sales by product category and by region, and you want to filter the data to only show sales for
a specific region. You can add a slicer for the region field, which will display a list of regions as buttons. Then, you can simply select the region that you want to filter by, and the pivot table will update
to show only the data for that region.
Overall, slicers are a convenient and user-friendly way to filter your pivot table data, and can make it
easier to analyze and understand complex data sets.
Here is an example dataset and how you can use slicers to filter data in a pivot table:
Let’s say you have a sales data set that looks like this:
Region
East
East
West
West
Product
A
B
A
B
Sales
100
150
200
250
Mastering Statistical Analysis with Excel
358
You can create a pivot table from this data set to summarize the sales data by region and by product.
To create the pivot table, you can follow these steps:
Select the entire data set and click on the “PivotTable” button in the “Insert” tab of the ribbon.
In the “Create PivotTable” dialog box, select the range of cells containing the data set and choose
where to place the pivot table (e.g., a new worksheet or an existing one).
In the “PivotTable Fields” pane, drag the “Region” and “Product” fields to the “Rows” section, and
drag the “Sales” field to the “Values” section.
Now you have a pivot table that shows the total sales for each product in each region.
Image showing dataset entered into Excel columns
Image showing Pivot table creation process by clicking on Insert tab and subsequently Pivot table tab
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing PivotTable Dialog box. Cursor is placed inside the Table/Range field and clicked.
When it starts to blink the entire dataset including the header is chosen and immediatly the respective cell addresses can be seen entered in this field. Location of the pivot table is chosen as a new
work sheet and on clicking OK button pivot table would be created.
Next, you can add a slicer to filter the data in the pivot table by region. To do this, you can follow
these steps:
1. Select any cell within the pivot table.
2. In the “PivotTable Analyze” tab of the ribbon, click on the “Insert Slicer” button.
3. In the “Insert Slicers” dialog box, select the “Region” field and click “OK”.
4. A new slicer box will appear on the worksheet. You can resize and move the slicer box to a convenient location.
5. Now you have a slicer that allows you to filter the pivot table data by region. You can select one or
more regions in the slicer to show only the data for the selected regions in the pivot table. For example, if you select “East” in the slicer, the pivot table will update to show only the sales data for the East
region.
Mastering Statistical Analysis with Excel
360
Image showing PivotTable Fields. The headers Region and Product are drawn into Rows field as
shown by pink arrow and Sales is pulled into Values field as shown by Blue arrow.
Image showing PivotTable created
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Insert slicer tab
Image showing slicer for Region introduced
Mastering Statistical Analysis with Excel
362
Image showing the effects of clicking on East Region. Results specific to this region gets displayed o
the left side
Image showing the result of clicking on West under Region category
Overall, slicers are a powerful feature of pivot tables that allow you to filter and analyze your data
quickly and easily. By using slicers, you can create more interactive and dynamic pivot tables that
make it easy to explore your data in different ways.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Mastering Statistical Analysis with Excel
364
26
Data Cleaning
leaning up the data refers to the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset. This process can include removing duplicate data, correcting
formatting errors, filling in missing values, standardizing data, and removing irrelevant or incorrect data.
C
Data cleaning is an important step in data analysis because it ensures that the data is accurate and consistent, which can lead to more reliable insights and conclusions. If data is not cleaned properly, it can
lead to misleading results, errors, and inaccuracies in any analysis or modeling that is done with the
data.
Overall, cleaning up the data is a crucial step in preparing data for analysis or modeling, and helps to
ensure that the results of the analysis are reliable and trustworthy.
Excel can be a powerful tool for cleaning up data. Here are some steps you can follow to clean up your
data in Excel:
1. Open your data in Excel: You can either import your data from a CSV, TXT, or other file format, or
copy and paste your data directly into Excel.
2. Identify the issues with your data: Take a look at your data and identify any issues that need to be
addressed. For example, you might have missing data, inconsistent formatting, or duplicates.
3. Remove duplicates: Use the “Remove Duplicates” feature in Excel to remove any duplicate rows in
your data.
4. Fill in missing data: Use Excel’s “Fill” feature to fill in any missing data based on the existing data in
your spreadsheet. For example, if you have a column for “State” but some rows are missing this information, you can use the “Fill” feature to fill in the missing state based on the state listed in the row
above.
5. Format data consistently: Use Excel’s formatting tools to ensure that your data is formatted consistently. For example, you might want to ensure that all dates are formatted in the same way, or that all
currency values have the same number of decimal places.
Prof Dr Balasubramanian Thiagarajan MS D.L.O
6. Use formulas to clean up data: Excel has a wide range of formulas that can help you clean up your
data. For example, you can use the “TRIM” formula to remove any extra spaces from your data, or
the “LOWER” formula to convert all text to lowercase.
7. Use filters to identify and remove data: Excel’s filters can help you identify specific data that needs
to be removed or edited. For example, you might use a filter to identify all rows where a certain column contains a specific word, and then delete those rows.
8. Save your cleaned-up data: Once you have finished cleaning up your data, be sure to save it in a
format that is easy to work with, such as a CSV or Excel file.
By following these steps, you can use Excel to clean up your data quickly and easily, and ensure that
your data is accurate and consistent.
Identifying data inconsistencies:
There are several ways to identify issues with data. Here are a few common methods:
1. Reviewing the data: Take a close look at the data to identify any inconsistencies, errors, or missing
values. Look for patterns or trends that may indicate issues with the data.
2. Using data visualization: Create charts or graphs to visualize the data and identify any anomalies
or outliers. Data visualization can also help you identify patterns or trends that may be difficult to
see in the raw data.
3. Running statistical analysis: Conduct statistical analysis on the data to identify any significant
differences or relationships. This can help you identify any issues with the data, such as outliers or
missing values.
4. Comparing the data to external sources: Compare the data to external sources, such as industry
benchmarks or government data. This can help you identify any issues with the data, such as inconsistencies or inaccuracies.
5. Using automated data cleaning tools: There are several automated data cleaning tools available
that can help you identify issues with the data, such as missing values or inconsistent formatting.
Overall, it’s important to take a thorough and systematic approach to identifying issues with data. By
using a combination of methods, you can ensure that your data is accurate and reliable.
Auto data cleaning tools in excel:
Excel has several built-in features and tools that can help automate the data cleaning process. Here
are a few examples:
1. Remove Duplicates: Excel’s “Remove Duplicates” feature can automatically identify and remove
Mastering Statistical Analysis with Excel
366
any duplicate rows in your dataset.
2. Text to Columns: Excel’s “Text to Columns” feature can split data in a column into separate columns based on a delimiter, such as a comma or a space.
3. Conditional Formatting: Excel’s “Conditional Formatting” feature can automatically highlight cells
that meet certain criteria, such as cells that contain errors or cells that are outside a certain range.
4. Data Validation: Excel’s “Data Validation” feature can help ensure that data is entered correctly by
setting rules for data entry, such as requiring a certain format or restricting the range of values.
5. Excel Formulas: Excel has a wide range of formulas that can help automate data cleaning tasks.
For example, you can use the “TRIM” formula to remove extra spaces in data, or the “IF” formula to
replace missing values with a default value.
6. Pivot Tables: Excel’s “Pivot Table” feature can help summarize and analyze large datasets, making it
easier to identify issues with the data.
Overall, these tools and features can help automate many of the data cleaning tasks in Excel, saving
time and improving the accuracy and consistency of your data.
Removing duplicate data:
Here’s an example dataset that includes some duplicate rows:
Name Age
Gender
John
Jane
John
Alice
Peter
Rachel
Male
Female
Male
Female
Male
Female
30
25
30
35
28
32
Occupation
Engineer
Accountant
Engineer
Lawyer
Doctor
Engineer
To remove the duplicate rows in this dataset, you can follow these steps:
1. Select the entire dataset by clicking on the top-left corner of the table, or by pressing “Ctrl+A” on
your keyboard.
2. Go to the “Data” tab in the Excel ribbon and click on the “Remove Duplicates” button.
3. In the “Remove Duplicates” dialog box, make sure that all columns are selected (in this case, all
four columns should be selected).
4. Click the “OK” button to remove the duplicate rows.
5. Excel will display a message showing how many duplicate rows were removed. Click “OK” to close
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
the message.
6. The duplicate rows will be removed from the dataset, and you should be left with only the unique
rows.
In this example, the two duplicate rows with the values “John, 30, Male, Engineer” are removed, leaving only one row with that combination of values.
Image showing Dataset containing duplicate data
Image showing remove duplicate dialog box which will be displayed on clicking the remove duplicate tab listed under data tab.
Mastering Statistical Analysis with Excel
368
Image showing confirmation dialog box confirming duplicate data has been removed
Locating blank data:
To locate blank data in Excel, you can use the “Go To Special” feature. Here are the steps:
Select the range of cells where you want to find blank data.
1. Go to the “Home” tab in the Excel ribbon and click on the “Find & Select” button in the “Editing”
group.
2. In the drop-down menu, select “Go To Special.”
3. In the “Go To Special” dialog box, select the “Blanks” option and click “OK.”
4. Excel will select all the blank cells in the range you specified.
5. To highlight the blank cells, you can use the “Conditional Formatting” feature. First, select the
blank cells as described above. Then, go to the “Home” tab in the Excel ribbon and click on “Conditional Formatting” in the “Styles” group. Select “Highlight Cell Rules” and then “Blank Cells.” Choose
a formatting option, such as a color, and click “OK.”
Now, all the blank cells in the selected range will be highlighted. You can use this information to fill
in the missing data, delete the rows or columns with blank cells, or take other actions as needed.
Here’s an example dataset that contains some empty cells:
Name Age
John 30
Jane
35
Peter 28
Rachel
Gender
Male
Female
Male
Female
Occupation
Engineer
Accountant
Lawyer
Engineer
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing dataset containing blank data
Image showing the location of Find and select tab and Go to Special menu
Mastering Statistical Analysis with Excel
370
Image showing Go To Special Dialog box where in Blanks radio button has been chosen.
Image showing Blank cells highlighted
To remove the empty cells in this dataset, you can follow these steps:
1. Select the entire dataset by clicking on the top-left corner of the table, or by pressing “Ctrl+A” on
your keyboard.
2. Go to the “Home” tab in the Excel ribbon and click on the “Find & Select” button in the “Editing”
group.
3. In the drop-down menu, select “Go To Special.”
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
4. In the “Go To Special” dialog box, select the “Blanks” option and click “OK.”
5. Excel will select all the empty cells in the dataset.
6. Right-click on one of the selected cells and choose “Delete” from the context menu.
7. In the “Delete” dialog box, select “Shift cells left” or “Shift cells up” depending on whether you
want to remove empty columns or empty rows. Make sure the “Entire row” or “Entire column” option is selected, depending on which one you want to delete.
8. Click “OK” to delete the empty cells.
Image showing Delete menu visible when the empty cell is right clicked.
Summary of steps involved in data cleaning:
Data cleaning is the process of identifying and correcting errors, inconsistencies, and inaccuracies in
data. Here are the general steps involved in data cleaning:
1. Define the problem: Determine what data you need to clean and what the purpose of the cleaned
data will be.
Mastering Statistical Analysis with Excel
372
2. Data collection: Gather all the available data from different sources that you need to clean.
3. Data inspection: This step involves visualizing the data and examining it for errors, inconsistencies, and missing values.
4. Data cleaning: This step involves the actual process of cleaning the data, which can include various
techniques such as:
Removing duplicates: Identify and remove duplicate data points.
Handling missing data: Decide how to handle missing values, either by filling in the missing values,
or removing the incomplete data.
Standardizing data: Ensure that data is consistently formatted, such as capitalizing text, or converting
dates to a specific format.
Correcting errors: Identify and correct errors, such as typos or misspellings.
Removing outliers: Identify and remove data points that are significantly different from the rest of
the data.
5. Data transformation: This step involves transforming the cleaned data into a format that can be
used for analysis. For example, converting categorical data into numerical data, or creating new variables based on existing ones.
6. Data integration: Combine cleaned data sets from different sources into a single dataset.
7. Data validation: Validate the cleaned data to ensure that it is accurate and reliable for analysis.
8. Documentation: Document the entire data cleaning process, including the steps taken and the
decisions made, to ensure that the cleaned data can be reproduced and understood by others.
Standardizing data using Excel:
Standardizing data in Excel involves converting data into a consistent format, such as converting text
to lowercase or uppercase, or converting dates to a specific format. Here’s an example dataset and the
steps to standardize it in Excel:
Let’s say we have a dataset of employee names, job titles, and salaries, and we want to standardize the
job titles to all be in uppercase letters.
Employee Name
John Doe
Jane Smith
Bob Johnson
Sarah Lee
Job Title
manager
technician
Analyst
Developer
Salary
$80,000
$50,000
$70,000
$90,000
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
1. Select the column that you want to standardize. In this case, we want to standardize the Job Title
column.
2. Click on the Home tab in the Excel ribbon and locate the Font section.
3. Click on the small arrow next to the lowercase “a” icon. This will open the Font dialog box.
4. In the Font dialog box, select “Uppercase” from the “Effects” section.
5. Click “OK” to close the Font dialog box.
6. The Job Title column will now be standardized to uppercase.
Employee Name
John Doe
Jane Smith
Bob Johnson
Sarah Lee
Job Title
MANAGER
TECHNICIAN
ANALYST
DEVELOPER
Salary
$80,000
$50,000
$70,000
$90,000
You can also use Excel functions like UPPER or PROPER to standardize text data. For example, to
standardize the employee names to all be in uppercase, you can use the UPPER function in a new
column:
1. Insert a new column next to the Employee Name column.
2. In the first cell of the new column, enter the formula “=UPPER(A2)” (assuming the first data row
is in row 2).
3. Drag the formula down to apply it to all the cells in the new column.
4. The new column will now contain the standardized employee names in uppercase.
Employee Name
John Doe
Jane Smith
Bob Johnson
Sarah Lee
Standardized Name
JOHN DOE
JANE SMITH
BOB JOHNSON
SARAH LEE
Job Title
MANAGER
TECHNICIAN
ANALYST
DEVELOPER
Salary
$80,000
$50,000
$70,000
$90,000
Note: When standardizing data, it’s important to be consistent and document the changes made.
Mastering Statistical Analysis with Excel
374
Image showing Data entered into Excel sheet
Image showing formula to creating upper case entered into column D. on pressing the Enter Key a
new column is created with all upper case alphabets.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing a new column created with all upper case alphabets
Correcting Errors and Mis-spelling:
Correcting errors and misspellings in a database using Excel can be done using the following steps:
1. Open the Excel file that contains the database you want to correct.
2. Identify the column or columns that contain the errors or misspellings.
3. Select the cells that contain the errors or misspellings.
4. Right-click on the selection and choose “Replace” from the context menu.
5. In the “Find what” field, enter the error or misspelling that you want to correct.
6. In the “Replace with” field, enter the correct spelling or information.
7. Click “Replace All” to replace all instances of the error or misspelling in the selected cells.
8. Review the cells to ensure that the corrections have been made.
Mastering Statistical Analysis with Excel
376
9. Save the Excel file with the corrections.
10. If the corrections need to be made to the entire database, repeat steps 2-9 for each column that
contains errors or misspellings.
By following these steps, you can correct errors and misspellings in a database using Excel.
Sample database with misspelling:
Image showing database with misspelling errors
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Replace menu
Image showing Find and Replace menu where Find and Replace fields have been filled
Mastering Statistical Analysis with Excel
378
Image showing the corrections executed
Removing outliers:
Removing outliers is a common task in data analysis, and there are several methods to do so. Here
are some common methods:
1. Z-score method: Calculate the z-score for each data point, and if the z-score is greater than a
certain threshold (typically 3 or 2.5), the data point is considered an outlier and removed from the
dataset.
2. Percentile-based method: Remove the data points that fall outside a certain percentile range, such
as the 5th and 95th percentile.
3. Tukey’s method: Calculate the interquartile range (IQR) for the dataset, and then remove any data
points that fall more than 1.5 times the IQR below the first quartile or above the third quartile.
4. Visual inspection: Plot the data and visually inspect for any points that are far away from the majority of the data points. This method is subjective and depends on the data and the person analyzing
it.
It is important to note that removing outliers can have a significant impact on the data and its statistical properties. Therefore, it is essential to carefully consider which method to use and to justify any
outlier removal to ensure that the analysis is still valid.
Sample dataset:
Car Price
20000
25000
30000
35000
40000
45000
50000
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
55000
60000
65000
70000
75000
80000
85000
90000
95000
100000
105000
110000
120000
Suppose that we suspect that there may be some outliers in this dataset. Here are the steps to identify
and remove them using Excel:
1. Calculate the mean and standard deviation of the dataset. To do this, enter the following formulas
in separate cells:
Mean: =AVERAGE(A2:A21)
Standard Deviation: =STDEV(A2:A21)
Note that we are assuming that the data starts in cell A2 and ends in cell A21. Adjust these cell references as necessary.
2. Calculate the z-scores for each data point. To do this, enter the following formula in the first cell
next to the first data point:
=(A2-$B$1)/$B$2
Here, $B$1 and $B$2 refer to the mean and standard deviation calculated in step 1, respectively. The
dollar signs around these cell references make them absolute, so they will not change when we copy
the formula to other cells. Copy this formula to the cells next to all other data points.
3. Identify potential outliers. In general, any data point with a z-score greater than 3 or less than -3
can be considered an outlier. In this example, we will use a threshold of 2.5 to be more lenient. To
highlight the potential outliers, select the range of cells with the z-scores, go to the “Home” tab, and
click on “Conditional Formatting” > “Highlight Cell Rules” > “Greater Than”. Enter the value 2.5 and
choose a format that will make the cells stand out, such as red fill.
4. Inspect the potential outliers. Check if the values identified as potential outliers are reasonable
given the context of the data. If any values seem suspicious, investigate further to determine if they
are valid data points or if they are errors.
5. Remove the outliers. Once you have identified the outliers that you want to remove, delete the
Mastering Statistical Analysis with Excel
380
rows that correspond to those data points from the dataset.
In this example, let’s say that we identify the values 20000 and 120000 as potential outliers. Upon
inspection, we determine that 20000 is a valid data point representing a very cheap car, but 120000
seems suspicious and may be an error. We will therefore remove the row containing 120000 from the
dataset.
After completing these steps, the dataset should look like this:
Car Price
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95000
100000
105000
110000
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing data of car price entered into Excel
Mastering Statistical Analysis with Excel
382
Image showing mean and standard deviation calculated
Image showing Z value calculated
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Percentile based method for identifying outliers:
Percentile-based methods are useful for identifying and removing outliers from a dataset. Here’s how
you can use this method:
1. Determine the percentiles to use: Start by selecting the percentiles to use for identifying outliers. A
common approach is to use the 95th and 5th percentiles as the upper and lower bounds, respectively. This means that any values above the 95th percentile or below the 5th percentile are considered
outliers.
2. Calculate the percentiles: Calculate the selected percentiles for the dataset. For example, if you are
using the 95th and 5th percentiles, you would calculate the 95th percentile and the 5th percentile for
your dataset.
3. Identify outliers: Identify any data points that fall above the 95th percentile or below the 5th percentile. These are considered outliers and should be removed from the dataset.
4. Remove outliers: Remove the identified outliers from the dataset. Depending on the nature of the
data and the analysis you’re conducting, you may choose to replace the outliers with more reasonable
values or simply remove them.
5. Recalculate percentiles: After removing the outliers, recalculate the percentiles for the remaining
data to ensure that they are still within the desired range.
6. Repeat if necessary: If you identify new outliers after removing the first set, repeat the process until
you have removed all the outliers.
It’s important to note that while percentile-based methods are useful for identifying outliers, they
may not be appropriate for all datasets. Additionally, it’s important to carefully consider the implications of removing outliers from your data, as this can have a significant impact on your analysis.
Here’s an example dataset that we can use to demonstrate the percentile-based method:
Data Point
10
12
15
20
25
30
35
40
45
50
55
60
Mastering Statistical Analysis with Excel
384
70
80
90
100
To identify outliers in this dataset using a percentile-based method in Excel, follow these steps:
1. Calculate the percentiles: Use the PERCENTILE function in Excel to calculate the 5th and
95th percentiles. For example, to calculate the 5th percentile, use the formula =PERCENTILE(A2:A16,0.05), where A2:A16 is the range of cells containing the data. Repeat this process for
the 95th percentile.
2. Identify outliers: Any data points that fall below the 5th percentile or above the 95th percentile
are considered outliers. In this case, any data point below 11.5 or above 88.5 would be considered an
outlier.
3. Remove outliers: Remove the identified outliers from the dataset. You can either delete the rows
containing the outliers or replace them with more reasonable values.
4. Recalculate percentiles: After removing the outliers, recalculate the percentiles to ensure that they
are still within the desired range.
5. Repeat if necessary: If you identify new outliers after removing the first set, repeat the process until
you have removed all the outliers.
Note that in step 1, you can adjust the percentile values to suit your needs. For example, you may
want to use the 1st and 99th percentiles instead of the 5th and 95th percentiles for a more stringent
outlier detection.
Image showing data entered and Percentile calculation formula entered
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing outlier’s marked in brown.
Data Transformation:
Data transformation in statistics refers to the process of converting data from one form or scale to
another while preserving the essential information or relationships between variables. It involves
applying mathematical or statistical functions to the original data to obtain a new set of values that
better satisfy the assumptions of the statistical analysis or modeling technique being used.
Data transformation can be useful in several ways, such as:
1. Normalizing the data: transforming the data to follow a normal distribution, which is often a requirement for many statistical methods.
2. Reducing skewness and outliers: transforming the data to reduce the effects of extreme values or
outliers, which can distort the analysis or modeling results.
Mastering Statistical Analysis with Excel
386
3. Linearizing relationships: transforming the data to make the relationship between variables more
linear, which can simplify the analysis or modeling and improve the accuracy of the results.
Examples of common data transformation methods include logarithmic, exponential, square root,
and inverse transformations. The choice of transformation method depends on the nature of the data
and the goals of the analysis or modeling.
Sample data for data transformation using excel. Pl enumerate the steps used in excel to perform
this action
Sure, I can suggest a sample dataset for data transformation using Excel. Let’s consider a hypothetical
dataset of exam scores of 20 students, ranging from 50 to 100.
Student ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Exam Score
75
80
85
70
90
60
95
65
70
80
50
85
75
90
60
70
75
95
80
90
To transform this dataset, let’s apply a logarithmic transformation to the exam scores. The steps to
perform this transformation in Excel are:
1. Open Microsoft Excel and import the dataset into a new workbook.
2. Create a new column next to the “Exam Score” column and label it “Log Score”.
3. In the first cell of the “Log Score” column, enter the formula “=LOG(B2)” where B2 is the cell
containing the first exam score.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
4. Copy the formula to the remaining cells in the “Log Score” column by selecting the first cell and
dragging the fill handle (the small square at the bottom-right corner of the cell) down to the last cell.
5. The transformed dataset is now ready for analysis. You can use the new “Log Score” column as a
replacement for the original “Exam Score” column in your statistical analysis or modeling.
Image showing data entered and Formula for calculating Log score keyed in
Mastering Statistical Analysis with Excel
388
Image showing Log score calculated on pressing the Enter button. Cells below can be filled automatically by pulling down the handle (dot in the lower left corner of the cell).
Image showing Log score column filled with data
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Data Integration:
Data integration in statistics refers to the process of combining data from multiple sources into a single, unified dataset for analysis. It involves merging or joining datasets that share common variables
or observations to create a larger dataset that provides a more comprehensive view of the phenomena
being studied.
Data integration can be useful in several ways, such as:
1. Enhancing data quality: combining data from different sources can help improve data quality by
filling in missing values, correcting errors, and reducing redundancy.
2. Enabling more comprehensive analyses: integrating data from multiple sources can provide a more
complete picture of the phenomena being studied, allowing for more comprehensive and accurate
analyses.
3. Supporting decision-making: integrating data can help decision-makers make more informed decisions by providing a more complete understanding of the relevant variables and their relationships.
Examples of common data integration techniques include merging datasets, joining datasets, and
appending datasets. The choice of technique depends on the nature of the data and the goals of the
analysis. Statistical software packages like R and Python have built-in functions for data integration,
and Microsoft Excel also has features for merging and joining datasets.
Example of data integration and the steps to follow for integrating data using Excel.
Let’s consider two hypothetical datasets of customer information from two different sources, such as
a customer database and a customer survey:
Dataset 1: Customer Database
Customer ID First Name
Last Name
001
John
Anytown, USA
002
Jane
St, Somewhere, USA
Email
Phone
Address
Smith
[email protected] 555-123-4567 123 Main St,
Doe
[email protected]
456 High
Dataset 2: Customer Survey
Customer ID Satisfaction Score
001
002
003
004
8
9
6
7
Likelihood to Recommend
7
9
4
8
Mastering Statistical Analysis with Excel
390
To integrate these datasets using Excel, we can follow these steps:
1. Open Microsoft Excel and create a new workbook.
2. Import both datasets into the workbook as separate sheets.
3. Identify the common variable between the two datasets (in this case, “Customer ID”) and make
sure that the variable is formatted consistently across the two datasets.
4. Merge the two datasets by using the “VLOOKUP” function in Excel. To do this, we can add a new
column to the customer database sheet called “Satisfaction Score” and another new column called
“Likelihood to Recommend”. We can then use the “VLOOKUP” function to match the Customer ID
in the customer database sheet with the corresponding row in the customer survey sheet and bring
in the satisfaction score and likelihood to recommend.
5. Once we have integrated the data, we can use it for further analysis or reporting.
The steps to use VLOOKUP function in Excel for data integration are as follows:
6. Add a new column in the customer database sheet to the right of the existing data.
7. In the first cell of the new column, enter the formula “=VLOOKUP(A2,[Customer Survey.xlsx]
Sheet1!$A$2:$C$5,2,FALSE)” where “A2” is the cell containing the first customer ID in the customer
database sheet, “[Customer Survey.xlsx]Sheet1!$A$2:$C$5” is the range containing the customer ID,
satisfaction score, and likelihood to recommend in the customer survey sheet, and “2” refers to the
column containing the satisfaction score.
8. Copy the formula to the remaining cells in the new column by selecting the first cell and dragging
the fill handle down to the last cell.
9. Repeat the same process to add another new column for “Likelihood to Recommend”, replacing
the “2” in the formula with “3”.
The integrated dataset is now ready for analysis.
Data validation:
Data validation is the process of ensuring that data entered into a system or database is accurate,
complete, and consistent with certain predefined rules or criteria. It is a critical step in maintaining
data integrity and preventing errors, duplication, and inconsistencies.
Data validation involves setting up rules or checks that ensure the data is within acceptable ranges or
values, and that it meets certain conditions. These rules can include checks for data type, data format,
data range, data length, and other specific requirements that the data needs to meet.
For example, if you have a form for collecting customer information, you may want to ensure that
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
the email addresses entered in the form are in a valid format, such as “
[email protected]”. You
may also want to ensure that the phone numbers are in a specific format, such as “(555) 123-4567”,
and that the date of birth is within a certain range.
Data validation can be performed manually by reviewing and verifying the data, or it can be automated using software tools and scripts. Automated data validation is typically faster and more accurate, as it can perform checks on large datasets and quickly identify errors and inconsistencies.
In addition to ensuring data accuracy and consistency, data validation can also help improve data
quality, reduce data entry errors, and improve the overall efficiency and reliability of data-driven
processes.
In Excel validate function can be used to validate the data entered. If the user desires to impose a
condition as far as date entry is concerned (cell should accept only dates between the interval specified) it can be specified under validation menu.
Image showing Data validation window where dates both start and End are entered
Mastering Statistical Analysis with Excel
392
Image showing the contents of the Tab Error alert window open and Error message that needs to be
displayed entered
Image showing Error message displayed
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
27
Data Visualization
ata visualization is the process of representing data and information in a visual format, such as
charts, graphs, and maps, to make it easier to understand, analyze, and communicate. It is a powerful tool for exploring and communicating complex data and can be used to identify patterns, trends, and
relationships that may not be immediately apparent in raw data.
D
Data visualization can be used in a variety of fields, including business, science, engineering, and
social sciences, to help stakeholders make informed decisions based on data insights. Examples of data
visualization include bar charts, line graphs, scatter plots, heat maps, and geographic maps.
Effective data visualization requires careful consideration of the audience, data type, and purpose of
the visualization. The choice of visualization type and design can greatly impact the interpretation and
understanding of the data.
Some of the benefits of data visualization include:
1. Improved data comprehension: Visualizing data can help users quickly understand complex data
and identify patterns and trends.
2. Better decision-making: Data visualization can help stakeholders make informed decisions based
on data insights.
3. Enhanced communication: Visualizing data can help stakeholders communicate their findings and
insights more effectively.
4. Increased efficiency: Data visualization can help users quickly identify and address issues and opportunities.
There are many tools and software available for creating data visualizations, such as Excel, Tableau,
and Python libraries like Matplotlib and Seaborn. These tools provide a range of options for creating
different types of visualizations and allow for customization of design and layout.
Mastering Statistical Analysis with Excel
394
There are many types of data visualization techniques that can be used to represent data and information in a visual format. Here are some of the most common types of data visualization:
1. Bar charts: Bar charts are used to compare different categories of data by showing bars of different
lengths. They are useful for displaying discrete or categorical data.
2. Line graphs: Line graphs are used to show trends over time or to compare different groups. They are
useful for displaying continuous data.
3. Scatter plots: Scatter plots are used to show the relationship between two variables. They are useful
for identifying patterns and trends in data.
4. Pie charts: Pie charts are used to show proportions of a whole. They are useful for displaying categorical data.
5. Heat maps: Heat maps are used to show the density or intensity of data in two-dimensional space.
They are useful for displaying large amounts of data.
6. Tree maps: Tree maps are used to show hierarchical data using nested rectangles. They are useful for
displaying data with multiple levels of detail.
7. Geographic maps: Geographic maps are used to display data geographically. They are useful for
showing patterns and trends across regions.
8. Bubble charts: Bubble charts are used to show the relationship between three variables by using
bubbles of different sizes. They are useful for displaying complex data.
These are just a few examples of the many types of data visualization techniques that can be used to
represent data and information. The choice of visualization type depends on the data type, the audience, and the purpose of the visualization.
Bar charts:
A bar chart is a type of graph used to compare different categories of data by displaying bars of different lengths. The length of each bar is proportional to the value or frequency of the data it represents.
Bar charts are often used to represent discrete or categorical data, such as the number of students in
each grade level or the sales figures of different products.
Bar charts can be displayed horizontally or vertically, and the bars can be arranged in different orders,
such as alphabetically or by size. They can also be grouped or stacked to display multiple sets of data.
One of the advantages of bar charts is their simplicity and ease of interpretation. They are easy to read
and understand, even for non-experts. Bar charts can also be customized to highlight specific data
points or to improve the visual appeal.
Bar charts can be created using various software tools, such as Excel, Tableau, and Python libraries
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
like Matplotlib and Seaborn. These tools provide different options for customization, such as color
schemes, labels, and annotations, to create effective and visually appealing bar charts.
Bar charts reveal the relationships and comparisons between different categories of data. The bars in
a bar chart represent the value or frequency of each category, and their lengths or heights are proportional to those values or frequencies. By looking at the bar chart, we can easily compare the sizes of
the bars and see which categories have the highest or lowest values.
Bar charts are particularly useful for showing patterns and trends in categorical data. They can be
used to answer questions such as:
1. Which category has the highest value or frequency?
2. Are there any significant differences between the categories?
3. Are there any trends or patterns in the data over time or across different groups?
4. Are there any outliers or unusual data points in the data?
Overall, bar charts are a simple and effective way to visualize categorical data and can be used to
communicate insights and findings to a wide range of audiences.
Here’s an example dataset:
Month Sales
Jan
500
Feb
750
Mar 1000
Apr 600
May 900
Jun
1200
To create a bar chart in Excel, follow these steps:
1. Enter your data into an Excel spreadsheet, with each column representing a different category or
variable, and each row representing a different observation or data point.
2. Select the data you want to include in your chart.
3. Click the “Insert” tab in the top menu bar.
4. Click the “Column” button in the “Charts” section of the menu bar.
5. Select the type of column chart you want to use (e.g., 2D column, 3D column, stacked column,
clustered column, etc.)
6. Excel will automatically generate a chart based on the data you selected. You can customize the
chart by adding a title, axis labels, data labels, and other features.
7. Once you’re happy with your chart, you can save it as an image or embed it directly into your Excel
spreadsheet.
Mastering Statistical Analysis with Excel
396
That’s it! With these steps, you can easily create a bar chart in Excel to visualize your data.
Image showing data that needs to be visualized as a bar chart is selected. Next the insert tab is selected which exposes the Recommended charts tab. On clicking it the user will be provide with a choice
of possible graph formats available.
Image showing Bar graph generated
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the insert chart dialog box showing various types of graphs available
Image showing horizontal bar chart for the same data
Mastering Statistical Analysis with Excel
398
Line Graphs:
Line graphs are a type of chart that displays data as a series of points, connected by straight lines.
They are commonly used to show trends or changes in data over time or across different categories.
The horizontal axis of a line graph represents the independent variable, while the vertical axis represents the dependent variable. Each point on the graph represents a specific data value, and the line
connecting the points helps to visualize the overall pattern of the data.
Line graphs can be useful for analyzing a wide range of data, such as stock prices, temperature
changes, sales figures, and population growth. They are also frequently used in scientific research to
display experimental results or to track changes in variables over time.
Line graphs are used in various fields, including science, finance, economics, social sciences, and
business, to represent data visually and identify trends and patterns. Here are some specific examples
of where line graphs are commonly used:
1. Stock market: Line graphs are used to track the performance of individual stocks, indices, or mutual funds over time.
2. Weather and climate: Line graphs are used to show temperature changes, precipitation levels, and
other weather patterns over time.
3. Sales and marketing: Line graphs are used to track sales figures, market share, and other marketing
metrics over time.
4. Scientific research: Line graphs are used to display experimental results, track changes in variables
over time, and visualize trends in data.
5. Population trends: Line graphs are used to track population growth, birth rates, death rates, and
other demographic trends over time.
6. Education: Line graphs are used to track student performance, analyze test scores, and monitor
academic progress over time.
Overall, line graphs are a versatile tool for visualizing data and identifying patterns that can be useful
in making decisions or gaining insights in various fields.
Here is a sample dataset for creating a line graph in Excel:
Month Sales
Jan
100
Feb
150
Mar 200
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
300
250
400
450
500
550
600
700
800
To create a line graph in Excel using this data, follow these steps:
1. Open a new or existing Excel spreadsheet and enter the data into two columns, with the labels in
the first row.
2. Select the two columns of data.
3. Click on the “Insert” tab in the top menu and select the “Line” chart type from the charts group.
4. Choose a sub-type of line chart that best suits your needs, such as a basic line chart with markers
or a stacked line chart.
5.The line chart will appear on the Excel sheet. You can customize the chart by adding titles, adjusting the axis scales, and changing the colors or styles of the lines.
6. Save the chart by clicking on the “Save” button or by copying and pasting it into another document
or presentation.
That’s it! You have successfully created a line graph using Excel.
Image showing the result of clicking on insert tab
Mastering Statistical Analysis with Excel
400
Image showing Line graph created for the sample data set
ScatterPlots:
Scatterplots are a type of graph used in statistics to display the relationship between two variables. In
a scatterplot, each observation is represented by a point, with one variable plotted on the x-axis and
the other variable plotted on the y-axis.
The position of each point on the scatterplot represents the values of the two variables for that observation. The pattern of the points on the scatterplot can reveal whether there is a relationship between
the two variables and what type of relationship it is.
For example, if the points on the scatterplot are clustered closely together in a straight line, this
suggests a strong linear relationship between the two variables. If the points are scattered in a more
random pattern, with no clear trend, this suggests that there is no significant relationship between
the variables.
Scatterplots can be used to identify outliers, patterns in data, and to make predictions based on the
relationship between the variables. They are commonly used in data analysis, scientific research, and
business applications.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Scatterplots are commonly used in statistics to visualize the relationship between two quantitative
variables. They are a powerful tool for exploring patterns and trends in data and identifying potential
outliers or influential observations.
Some specific applications of scatterplots in statistics include:
1. Correlation analysis: Scatterplots are used to assess the strength and direction of the relationship
between two variables. A scatterplot can reveal whether the relationship is linear or nonlinear, and
can help to identify potential outliers that may affect the correlation coefficient.
2. Regression analysis: Scatterplots are often used to visualize the relationship between the predictor
variable(s) and the response variable in a regression analysis. A scatterplot can reveal whether the
relationship is roughly linear, and whether there are any non-linear trends or outliers that may affect
the regression model.
3. Cluster analysis: Scatterplots can be used to visualize clusters of data points that may indicate different groups or sub-populations in the data. This can help to identify potential variables or factors
that may be driving the clustering pattern.
4. Time series analysis: Scatterplots can be used to visualize the relationship between two time series
variables, such as stock prices or weather patterns over time. This can help to identify potential
trends or cycles in the data.
Overall, scatterplots are a useful tool for exploratory data analysis and hypothesis generation, and
can provide valuable insights into the underlying relationships between variables in a dataset.
Here’s an example dataset that can be used to create a scatterplot:
X
Y
1
2
3
4
5
6
7
3
5
7
8
11
13
15
To create a scatterplot in Excel, you can follow these steps:
1. Open a new or existing Excel workbook, and enter the X and Y data into separate columns.
2. Select the two columns of data by clicking and dragging over the cells, or by clicking on the column letters at the top of the screen.
Mastering Statistical Analysis with Excel
402
3. Click on the “Insert” tab at the top of the screen, and then click on the “Scatter” chart icon in the
Charts group. You can choose any type of scatter chart, but for this example, we’ll use a basic scatter
chart with markers only.
4. Excel will create a new chart object on the current worksheet. You can customize the chart by adding a title, adjusting the axes, changing the chart style, etc.
5. To add a trendline to the scatterplot, right-click on one of the data points in the chart, and then
select “Add Trendline” from the context menu. In the “Format Trendline” pane, you can choose the
type of trendline (linear, exponential, polynomial, etc.) and customize the line style and color.
6. Once you’ve customized the scatterplot to your liking, you can save the chart as an image or copy
and paste it into another document or presentation.
That’s it! With these simple steps, you can create a scatterplot in Excel to visualize the relationship
between two variables in your data.
Image showing Scatterplot icon listed under Insert tab
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the type of scatterplot chosen from the various available types
Image showing Scatterplot created
Mastering Statistical Analysis with Excel
404
Image showing Trendline right click menu visible when pointer is placed over a data point and right
clicked
Image showing Trendline dialog box where Linear trendline is selected
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing scatterplot with trendline created for the sample data
Piecharts:
Pie charts are a type of data visualization tool that are commonly used in statistics to show how
different parts make up a whole. In a pie chart, the whole is represented by a circle, and the different
parts are represented by slices of the circle that are proportional in size to the values they represent.
Pie charts are particularly useful when you want to compare the relative sizes of different categories
or subgroups within a dataset. They can be used to show how much of a total value is contributed by
each category or subgroup, and can help to highlight the most important or dominant categories.
Pie charts are also useful for displaying data that is categorical or qualitative in nature, such as survey
responses, product sales by category, or demographic information. They can be easily created using
software programs like Excel, Google Sheets, or Tableau, and can be customized with different colors,
labels, and legends to make them more informative and visually appealing.
However, pie charts also have some limitations. Because they rely on the size of the pie slices to
represent the data, it can be difficult to accurately compare the sizes of individual slices or to read off
precise values from the chart. Pie charts also become less effective when there are too many categories or when the differences between categories are small, as the slices can become too small to
accurately represent the data. In these cases, other data visualization tools like bar charts or stacked
bar charts may be more effective.
Mastering Statistical Analysis with Excel
406
Pie charts are commonly used in statistics to display proportions or percentages of categorical data.
They are a useful tool for presenting information about how a set of data is divided into different
categories, and for comparing the relative sizes of those categories.
Some specific applications of pie charts in statistics include:
1. Market share analysis: Pie charts can be used to show the market share of different companies or
brands in a particular industry or market. This can help to identify which companies or brands are
dominating the market, and how much of the market is being captured by each.
2. Survey data analysis: Pie charts are often used to present the results of survey questions that ask
respondents to choose from a set of predefined categories. For example, a pie chart might be used to
show the distribution of survey respondents’ age groups, or the percentage of respondents who chose
each option for a multiple-choice question.
3. Budget analysis: Pie charts can be used to show how a budget is divided up into different categories of spending. This can help to identify areas of high or low spending, and to visualize how the
budget is being allocated across different areas.
4.Demographic analysis: Pie charts can be used to display the distribution of demographic data, such
as age, gender, or ethnicity. This can help to identify patterns or trends in the data, and to compare
the relative sizes of different demographic groups.
Overall, pie charts are a useful tool for presenting data about categorical variables, and can be effective in communicating the relative sizes of different categories to a wide audience.
Here is a sample dataset that you can use to create a pie chart:
Category
Sales
Category A
Category B
Category C
500
300
200
To create a pie chart in Excel, follow these steps:
1. Enter your data into an Excel worksheet, making sure that the data is organized in columns or
rows.
2. Select the cells containing the data you want to use for the pie chart.
3. Click the “Insert” tab on the Ribbon at the top of the Excel window.
4. In the Charts section, click on the “Pie” chart icon. This will open a dropdown menu with various
types of pie charts to choose from.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
5. Select the type of pie chart that you want to create.
6. Excel will create a default pie chart for you, but you can customize it to fit your needs. You can
change the colors, fonts, and other formatting options by using the chart tools that appear when you
click on the chart.
7. You can also add titles, axis labels, and other chart elements by using the options in the “Chart
Elements” section of the Ribbon.
When you’re satisfied with your chart, save your Excel worksheet so that you can access it later.
That’s it! With these simple steps, you can easily create a pie chart in Excel to visualize your data.
Image showing Pie chart created for the sample data
Mastering Statistical Analysis with Excel
408
Types of pie chart:
There are several types of pie charts that you can create in Excel or other charting software, including:
1. Basic pie chart: This is the most common type of pie chart, which shows the proportional relationship between different categories.
2. Exploded pie chart: This type of chart “explodes” one or more segments from the rest of the chart,
to emphasize the differences between them.
3. Doughnut chart: This type of chart is similar to a basic pie chart, but with a hole in the center. It is
useful when you have multiple sets of data to compare.
4. 3D pie chart: This type of chart adds a third dimension to the chart, making it appear more visually appealing. However, this can sometimes make it harder to accurately interpret the data.
5. Stacked pie chart: This type of chart stacks multiple pie charts on top of one another, to show how
each category is divided into subcategories.
These are some of the most common types of pie charts, but there may be other variations as well
depending on the charting software you are using.
Exploded piechart:
An exploded pie chart is a type of chart that displays data as a circular graph, with each slice representing a portion of the whole. In an exploded pie chart, one or more slices are separated from the
rest of the pie, giving the appearance that they have “exploded” outwards. This is done to emphasize
or highlight a particular segment of the data.
For example, if a pie chart is displaying the sales data for different products, an exploded slice could
represent the product with the highest sales or the one that the presenter wants to draw attention to.
The separation of the slice from the rest of the pie draws the viewer’s eye to that particular segment,
making it stand out more prominently than the others. However, some experts argue that exploded
pie charts can make it difficult to accurately compare the sizes of the different slices, and therefore,
they should be used with caution.
An exploded pie chart can be used to highlight or emphasize a particular segment of a data set, by visually separating it from the rest of the pie chart. This type of chart can be useful in situations where
there are multiple data sets that need to be compared or analyzed, and where one or more of these
data sets are significantly larger or smaller than the others.
Here are some common indications for using an exploded pie chart:
1. Emphasize a particular segment: An exploded pie chart can be used to draw attention to a particular segment of a data set, such as a particularly large or small portion of the whole.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
2. Highlight differences: An exploded pie chart can help highlight the differences between data sets
by visually separating them.
3. Show proportions: A pie chart can be useful for showing the proportion of each data set, and an
exploded pie chart can make this even clearer by separating out each segment.
4. Improve readability: In some cases, an exploded pie chart can improve the readability of a data set
by making it easier to distinguish between the different segments.
However, it’s worth noting that some experts advise against using exploded pie charts, as they can
sometimes distort the data and make it harder to accurately interpret. As with any type of chart, it’s
important to carefully consider the purpose and audience before deciding whether an exploded pie
chart is the best option.
Here is a sample data set that we can use to create an exploded pie chart:
Category
A
B
C
D
Value
30
20
10
40
To create an exploded pie chart in Excel, follow these steps:
1. Enter the data into an Excel spreadsheet. In our example, the Category column should be in Column A, and the Value column should be in Column B.
2. Select the data range by clicking on the first cell in the Category column, and dragging down to the
last cell in the Value column.
3. Go to the Insert tab in the Excel ribbon, and click on the Pie Chart icon.
4. Select the “3-D Pie” chart type, and choose the style of your choice.
5. Once the chart has been created, click on the chart to select it.
6. Right-click on one of the slices in the chart and choose “Format Data Series” from the menu.
7. In the “Series Options” section, increase the “Explosion” value to separate the slice from the rest of
the chart.
8. Repeat the previous step for any other slices that you want to separate from the chart.
9. You can also format the chart further by adding titles, labels, or changing the colors of the slices.
10. Once you’re satisfied with the chart, you can save it or copy it to use in your presentation or re-
Mastering Statistical Analysis with Excel
410
port.
That’s it! By following these steps, you can easily create an exploded pie chart in Excel to showcase
your data.
Image showing 3D pi chart menu which can be used to insert 3 D pie chart
Image showing Pie chart created
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing right click submenu from which Format Data Series
Image showing Pie explosion set at 21%.
Mastering Statistical Analysis with Excel
412
Doughnut chart:
A donut chart is a type of chart that is similar to a pie chart, but with a hole in the center. Like a pie
chart, a donut chart is used to show the proportion of different categories in a data set. However, the
hole in the center allows for additional information or context to be displayed, such as a total value
or a percentage.
Donut charts are often used when there are multiple data sets that need to be compared, as they can
display multiple series of data within a single chart. They can also be used to show how different segments relate to the whole, by displaying the total value in the center of the chart.
. To create a donut chart in Excel, you can follow these steps:
1. Enter your data into an Excel spreadsheet. Make sure that the data is organized into categories and
values.
2. Select the data range by clicking on the first cell in the Category column, and dragging down to the
last cell in the Value column.
3. Go to the Insert tab in the Excel ribbon, and click on the Pie Chart icon.
4. Select the “Doughnut” chart type, and choose the style of your choice.
5. Once the chart has been created, you can customize it further by adding titles, labels, or changing
the colors of the slices.
6. You can also format the chart to display additional information in the center, such as the total
value or a percentage.
7. Once you’re satisfied with the chart, you can save it or copy it to use in your presentation or report.
Overall, a donut chart can be a useful and visually appealing way to display data, especially when
there are multiple data sets that need to be compared or when context needs to be displayed alongside the data.
Here is a sample data set that we can use to create a donut chart in Excel:
Category
A
B
C
D
Value
30
20
10
40
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the Data columns selected and the submenu under Piechart containing Doughnut
chart selected
Image showing Doughnut chart created
Mastering Statistical Analysis with Excel
414
3D Pie charts:
A 3D pie chart is a type of data visualization that represents data in a circular chart divided into slices to illustrate numerical proportions. The chart is called 3D because it appears to have three dimensions and provides a more realistic view of the data than a 2D pie chart.
In a 3D pie chart, each slice of the chart represents a portion of the whole, and the size of each slice
is proportional to the value it represents. The chart’s depth creates a sense of depth and perspective,
which can make the data more visually appealing and easier to understand.
However, it’s important to note that 3D pie charts can sometimes be more challenging to read accurately than their 2D counterparts. Some users may find the added dimensionality of a 3D pie chart
visually distracting, and the extra depth can make it harder to accurately compare the sizes of different slices. Therefore, it’s important to use these charts judiciously and consider other chart types,
depending on the context and data being presented.
Here’s an example set of data that you can use to create a 3D pie chart:
Suppose you are tracking the sources of traffic to a website. You have collected the following data for
the past month:
Organic Search: 40%
Direct Traffic: 25%
Referral Traffic: 20%
Social Media: 10%
Paid Search: 5%
You can use this data to create a 3D pie chart that shows the proportion of each traffic source. The
chart will have five slices, one for each traffic source, with each slice’s size proportional to the percentage of traffic it represents. The 3D effect will create a sense of depth and perspective in the chart,
making it more visually appealing and easier to interpret.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing data entered and 3D pie chart menu chosen from the list of Pie charts
Image showing 3D Piechart created
Mastering Statistical Analysis with Excel
416
Stacked chart:
A stacked pie chart, also known as a stacked sector chart, is a type of chart that represents data using
a circular chart divided into slices to illustrate numerical proportions. In a stacked pie chart, each
slice of the chart is composed of multiple smaller slices, each representing a part of the whole.
The slices are stacked on top of each other, with each stack representing a category or subcategory of
data. The size of each slice is proportional to the value it represents, and the entire chart represents
the total data set.
The stacked pie chart is useful when you want to show how parts of a category contribute to the
whole. It allows you to see both the overall distribution of a data set and the relative contribution of
each subcategory.
However, it’s important to note that stacked pie charts can be more challenging to read and interpret
than regular pie charts, especially if there are too many slices. It’s also important to use caution when
using stacked pie charts as they can distort the data or present a misleading picture if not properly
constructed.
Here is an example set of data that you can use to create a stacked pie chart:
Suppose you are tracking the sales of a company’s product line, and you want to see how different
products contribute to the total sales. You have collected the following data for the past quarter:
Product A: $50,000
Product B: $30,000
Product C: $20,000
Product D: $15,000
Product E: $10,000
You can use this data to create a stacked pie chart that shows the contribution of each product to the
total sales. In this chart, each product will be represented by a slice of the pie, with the size of each
slice proportional to the product’s sales. The slices will be stacked on top of each other, with the largest product (in this case, Product A) on the bottom and the smallest product (Product E) on the top.
The resulting chart will show the total sales, as well as the relative contribution of each product to the
overall total. It can be a useful way to visualize how a variety of factors contribute to a larger data set.
Here are the steps to create a stacked pie chart using Excel:
1. Enter your data: Open Microsoft Excel and enter your data in a spreadsheet. Each row should
represent a category or subcategory, and each column should represent a data point. For example, the
first row could be your product names, and the second row could be the sales data for each product.
2. Select your data: Click and drag to select the data that you want to include in your chart.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
3. Insert the chart: Go to the “Insert” tab on the Excel ribbon and select “Pie” from the chart options.
Choose the stacked pie chart option from the list of chart subtypes.
4. Customize your chart: After inserting the chart, you can customize it by right-clicking on the chart
and selecting “Format Chart Area” or “Format Data Series” from the dropdown menu. This will
allow you to change the chart’s appearance and behavior.
5. Add labels and titles: You can add titles and labels to your chart by clicking on the chart and going
to the “Chart Design” tab on the Excel ribbon. Here, you can add a chart title, data labels, and other
chart elements.
6. Save and share: Once you have customized your chart to your liking, save it and share it with others by either printing it or sharing it electronically.
Following these steps will allow you to create a stacked pie chart in Excel to visualize your data.
Heat Maps:
Heat maps are a data visualization technique that represents data using colors to show the relative
intensity or density of values in a matrix or table. Heat maps are commonly used in fields such as
statistics, data science, and business intelligence to visualize complex data sets and patterns.
A typical heat map displays a matrix of values, where each cell in the matrix is represented by a colored square. The color of each square corresponds to the magnitude or frequency of the value in that
cell, with darker colors indicating higher values or more frequent occurrences.
Heat maps are particularly useful for identifying patterns and trends in large data sets, especially
those with many variables or dimensions. They allow the viewer to quickly identify areas of high or
low activity and can reveal hidden relationships or correlations in the data.
Heat maps can also be interactive, allowing users to drill down into the data and explore specific
areas of interest. They can be created using a variety of software tools and programming languages,
including Excel, R, Python, and Tableau.
Here are some advantages of using a heat map as a data visualization technique:
1. Easy to interpret: Heat maps use color to represent data, making it easy to quickly identify patterns
and trends in large data sets.
2. Reveals hidden insights: Heat maps can reveal correlations and relationships that may be difficult
to see using other data visualization techniques.
3. Efficient use of space: Heat maps can represent large amounts of data in a relatively small space,
making it possible to see an overview of complex data sets at a glance.
Mastering Statistical Analysis with Excel
418
4. Customizable: Heat maps can be customized to show different variables and dimensions, making it
possible to explore different aspects of the data and highlight specific areas of interest.
5. Interactive: Heat maps can be interactive, allowing users to drill down into the data and explore
specific areas of interest.
6. Applicable to a variety of data types: Heat maps can be used to visualize a wide variety of data
types, including numerical, categorical, and textual data.
Overall, heat maps are a powerful data visualization tool that can help to identify hidden patterns
and trends in complex data sets. They are easy to interpret, efficient in terms of space usage, and
highly customizable, making them a popular choice in many fields, including data science, business
intelligence, and finance.
Here’s a sample data set that can be used to generate a heat map:
Suppose you are analyzing the performance of a company’s sales team over the past year. You have
collected the following data for each salesperson:
Salesperson name
Number of calls made
Number of emails sent
Number of meetings attended
Number of deals closed
You can use this data to generate a heat map that shows the relative performance of each salesperson
across these different categories.
Here are the steps to create a heat map using Excel:
1. Enter your data: Open Microsoft Excel and enter your data in a spreadsheet. Each row should represent a salesperson, and each column should represent a category (e.g. calls made, emails sent, etc.).
The data in each cell should represent the number of items in that category for each salesperson.
2. Select your data: Click and drag to select the data that you want to include in your chart.
3. Insert the chart: Go to the “Insert” tab on the Excel ribbon and select “Heat Map” from the chart
options.
4. Customize your chart: After inserting the chart, you can customize it by right-clicking on the chart
and selecting “Format Chart Area” or “Format Data Series” from the dropdown menu. This will allow you to change the chart’s appearance and behavior, including the color scheme used to represent
the data.
5. Add labels and titles: You can add titles and labels to your chart by clicking on the chart and going
to the “Chart Design” tab on the Excel ribbon. Here, you can add a chart title, axis labels, and other
chart elements.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
6. Save and share: Once you have customized your chart to your liking, save it and share it with others by either printing it or sharing it electronically.
Following these steps will allow you to create a heat map in Excel to visualize your data. The resulting
chart will show the relative performance of each salesperson across the different categories, allowing
you to quickly identify areas of strength and weakness.
Image showing heat map generated using conditional formatting
Image showing heatmap generated
Mastering Statistical Analysis with Excel
420
Tree Maps:
Treemaps are a type of data visualization technique that display hierarchical data in a rectangular layout. The layout is composed of rectangles, where the size of each rectangle corresponds to the magnitude of a certain value or metric associated with the data. The rectangles are arranged in a way that
reflects the hierarchical structure of the data, with smaller rectangles nested within larger rectangles.
Treemaps can be used to represent a variety of data, such as file sizes on a hard drive, market share
of different companies, or the budget breakdown of a government agency. They provide a way to
quickly see the relative size of different categories within a hierarchy, and can also be used to identify
patterns or outliers within the data.
There are different algorithms and implementations of treemaps, such as the squarified algorithm,
the slice-and-dice algorithm, and the binary tree algorithm. Each algorithm has its own strengths
and weaknesses, and the choice of which one to use depends on the specific data and the goals of the
visualization.
Treemaps are used in a variety of scenarios where hierarchical data needs to be visualized and analyzed. Here are a few examples:
1. File systems: Treemaps can be used to visualize the file system of a computer, with the size of each
rectangle representing the size of a file or folder. This can help users identify large files or folders that
are taking up too much space on their hard drive.
2. Market share: Treemaps can be used to display the market share of different companies in a particular industry. The size of each rectangle represents the percentage of the market held by a particular
company, and the color of the rectangle can be used to distinguish between different companies.
3. Budget breakdown: Treemaps can be used to visualize the budget breakdown of a government
agency or a business. The size of each rectangle represents the amount of money allocated to a particular program or department, and the color can be used to indicate whether the program is over or
under budget.
4. Website traffic: Treemaps can be used to visualize website traffic, with the size of each rectangle
representing the number of visitors to a particular section of the website. This can help website owners identify which pages are most popular and which ones need improvement.
5. Organizational structure: Treemaps can be used to visualize the organizational structure of a
company or a government agency, with each rectangle representing a department or a team. This can
help managers identify areas of the organization that are over or understaffed, and make adjustments
accordingly.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Here’s a sample dataset you can use to create a treemap:
Category Subcategory
A
A
A
B
B
C
C
C
A1
A2
A3
B1
B2
C1
C2
C3
Value
100
200
300
150
250
50
75
100
To create a treemap in Excel, follow these steps:
1. Open your Excel workbook and select the data you want to use for the treemap.
2. Click on the Insert tab in the top menu.
3. Click on the Treemap option in the Charts group.
4. Excel will automatically create a treemap chart based on your data.
5. Customize the chart as needed by adding titles, changing the color scheme, or adjusting the size
and position of the chart.
To add labels to the treemap, select the chart and click on the Layout tab in the top menu.
1. Click on the Labels option in the Labels group.
2. Choose whether you want to show the labels for the category, subcategory, or value.
3. Adjust the font size and position of the labels as needed.
That’s it! Your treemap chart is now complete and can be used to analyze and visualize your hierarchical data.
Mastering Statistical Analysis with Excel
422
Image showing data entered and Treemap chosen from the menu
Image showing Treemap created
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Geographic maps:
Geographic maps in statistics are visual representations of data that are associated with specific geographic locations. They use various symbols, colors, and shading techniques to represent statistical
data related to a particular region or location.
These maps can be used to display various types of data, such as population density, economic
activity, weather patterns, or crime rates, among others. They are a powerful tool for analyzing data
and identifying patterns, trends, and relationships between different variables across geographical
regions.
Geographic maps in statistics are often used in fields such as economics, public health, urban planning, and environmental studies. They can be created using a variety of software programs, including
Geographic Information Systems (GIS), which are specialized software designed for mapping and
analyzing spatial data.
There are many scenarios in which geographic maps are used. Here are a few examples:
1. Public Health: Health officials use geographic maps to track the spread of diseases and identify
areas where outbreaks are occurring. They can also use maps to monitor health trends in different
regions and identify areas where specific health interventions may be needed.
2. Marketing: Companies can use geographic maps to target specific areas for marketing campaigns.
By analyzing demographic data, they can identify areas with high concentrations of potential customers and create targeted advertising campaigns.
3. Urban Planning: City planners use geographic maps to analyze land use patterns, traffic flow, and
population density. This information can be used to plan new developments and transportation
infrastructure.
4. Environmental Studies: Environmental scientists use geographic maps to analyze patterns of air
and water pollution, habitat destruction, and climate change. They can also use maps to identify
areas with high concentrations of endangered species and plan conservation efforts.
5. Emergency Management: During natural disasters, emergency responders use geographic maps to
coordinate response efforts and identify areas where people may be in need of assistance. They can
also use maps to track the movement of the disaster and identify areas where it may cause the most
damage.
Overall, geographic maps are a powerful tool for analyzing data and understanding how different
variables are distributed across different regions. They can help us identify patterns and trends that
may not be visible in other types of data visualization.
Mastering Statistical Analysis with Excel
424
Here’s some sample data that you can use to create a geographic map in Excel:
Country
USA
Canada
Mexico
Brazil
UK
France
Germany
Italy
Spain
Sales
100
50
75
25
150
75
100
50
75
To create a geographic map using Excel, follow these steps:
1. Select the data you want to use for your map, including any headers. In this example, select both
columns of data, including the headers.
2. Click on the “Insert” tab at the top of the Excel window.
3. In the “Charts” section, click on the “Maps” dropdown menu.
4. Select the type of map you want to create. In this example, select “Filled Map.”
5. Excel will automatically generate a map based on your data. You can customize the map by using
the formatting options in the “Chart Design” and “Format” tabs.
6. You can also add additional data to the map by using the “Map Labels” dropdown menu and selecting the data you want to display.
7. Once you are satisfied with your map, you can save it as an image or embed it into a document or
presentation.
That’s it! With these steps, you can easily create a geographic map in Excel using your data.
The user will have to wait for a few seconds for the map to be generated and one needs to be connected to Internet for this to happen.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing sales data for various countries entered and Map is chosen from the insert tab
Image showing Map generated
Mastering Statistical Analysis with Excel
426
Bubble charts:
Bubble charts are a type of data visualization that displays data points as circles, or “bubbles,” on a
two-dimensional graph. The size of each bubble represents a third variable, such as the magnitude or
frequency of a particular data point.
Bubble charts are similar to scatter plots, which also use x and y coordinates to plot data points.
However, scatter plots do not include a third variable like bubble charts do. Instead, scatter plots use
the size of the data points to represent the same variable that bubble charts use.
Bubble charts can be useful for identifying patterns or trends in data, especially when multiple
variables are involved. They are often used in finance, economics, and social sciences to display data
related to market trends, population statistics, and other types of data with multiple variables.
Bubble charts can be used in various scenarios to display data that involves multiple variables. Here
are some common scenarios where bubble charts can be used:
1. Market Analysis: Bubble charts are often used in financial analysis to display the relationship between stock prices, market capitalization, and trading volume. They can help investors and analysts
identify trends and opportunities in the stock market.
2. Population Data: Bubble charts can also be used to display population statistics, such as the relationship between a country’s population, GDP, and life expectancy. These charts can help policymakers and researchers identify trends and patterns in demographic data.
3. Science and Engineering: Bubble charts can be used in science and engineering to display data
related to experiments or simulations. For example, a bubble chart could display the relationship
between temperature, pressure, and reaction rate in a chemical reaction.
4. Marketing and Advertising: Bubble charts can be used to display data related to consumer behavior, such as the relationship between product price, customer satisfaction, and brand loyalty. These
charts can help marketers and advertisers make data-driven decisions about pricing, branding, and
advertising campaigns.
5. Sports Analytics: Bubble charts can be used in sports analytics to display data related to player
performance, such as the relationship between player statistics, playing time, and team success. These
charts can help coaches and analysts identify patterns and make decisions about player usage and
game strategy.
These are just a few examples of the many scenarios where bubble charts can be used. Essentially,
bubble charts can be used to display any type of data that involves multiple variables and can help
identify patterns and relationships.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Here’s some sample data that you can use to create a bubble chart in Excel:
Product
A
B
C
D
E
Price ($)
10
20
15
25
5
Sales (units)
500
250
750
1000
2000
Advertising Cost ($)
1000
2000
1500
2500
500
To create a bubble chart in Excel, follow these steps:
1. Select the data you want to use for your chart, including any headers. In this example, select all
columns of data, including the headers.
2. Click on the “Insert” tab at the top of the Excel window.
3. In the “Charts” section, click on the “Bubble” dropdown menu.
4. Select the type of bubble chart you want to create. In this example, select “Bubble with 3-D Effect.”
5. Excel will automatically generate a bubble chart based on your data. By default, the chart will use
the first two columns of data for the x and y axis, and the third column of data for the size of the bubbles.
6. You can customize the chart by using the formatting options in the “Chart Design” and “Format”
tabs. For example, you can change the color and shape of the bubbles, add a title or legend, and adjust the axis labels and gridlines.
7. You can also add additional data to the chart by using the “Add Chart Element” dropdown menu
and selecting the data you want to display.
8. Once you are satisfied with your chart, you can save it as an image or embed it into a document or
presentation.
That’s it! With these steps, you can easily create a bubble chart in Excel using your data.
Mastering Statistical Analysis with Excel
428
Image showing data entered and Bubble chart is chosen from the list of chart under scatter
Image showing Bubble chart created
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Data
Mining
28
ata mining is the process of discovering patterns, trends, and insights in large datasets using statistical and computational methods. It involves using software tools and techniques to analyze and
extract meaningful information from large volumes of data.
D
Data mining can be used in various fields, such as marketing, finance, healthcare, and scientific research. It can help organizations identify hidden patterns and relationships in their data, which can be
used to make data-driven decisions and improve business outcomes.
Some common techniques used in data mining include clustering, classification, regression analysis,
association rule mining, and decision trees. These techniques involve using algorithms to analyze large
datasets and identify patterns and trends that can be used to make predictions or inform business
decisions.
Data mining can be used to address a wide range of business problems, such as customer segmentation, fraud detection, risk assessment, and product recommendation. It is an important tool in modern data analysis and has become increasingly popular as the amount of data generated by organizations continues to grow.
Data mining is important for several reasons:
1. Improved Decision Making: By using data mining techniques, organizations can analyze large datasets and uncover hidden patterns and relationships that can be used to make better decisions. This can
lead to improved efficiency, increased profitability, and better customer satisfaction.
2. Cost Savings: Data mining can help organizations identify areas where costs can be reduced or
eliminated. For example, by analyzing customer data, organizations can identify customer segments
that are not profitable and focus their marketing efforts on more profitable segments.
3. Competitive Advantage: Organizations that are able to use data mining effectively can gain a competitive advantage over their competitors. By identifying trends and patterns in their data, organizations can make strategic decisions that give them an edge in the marketplace.
4. Improved Customer Satisfaction: By analyzing customer data, organizations can gain insights into
customer behavior and preferences. This information can be used to develop more personalized marMastering Statistical Analysis with Excel
430
keting campaigns, improve product offerings, and provide better customer service.
5. Fraud Detection: Data mining can be used to identify fraudulent activities, such as credit card fraud
or insurance fraud. By analyzing large datasets, organizations can identify patterns and anomalies that
indicate fraudulent behavior.
Overall, data mining is an important tool for organizations looking to improve their decision-making,
reduce costs, and gain a competitive advantage in the marketplace. It allows organizations to extract
valuable insights from their data and make data-driven decisions that can lead to improved business
outcomes.
Excel can be used for some basic data mining tasks. Here are some steps to get started with data mining in Excel:
1. Import Data: The first step is to import your data into Excel. You can do this by going to the Data
tab and selecting the “From Text/CSV” or “From Excel” option. You can also copy and paste data
directly into Excel.
2. Clean and Transform Data: Before you can analyze your data, you may need to clean and transform
it. Excel has several tools that can help you do this, such as the Text to Columns tool and the Remove
Duplicates tool. You can also use formulas and functions to clean and transform your data.
3. Explore Data: Once your data is clean and transformed, you can start exploring it to identify patterns and trends. Excel has several tools for data exploration, such as PivotTables and PivotCharts.
These tools allow you to summarize and visualize your data in different ways.
4. Apply Data Mining Techniques: Excel also has several built-in data mining techniques that you can
use, such as clustering, classification, and regression analysis. You can access these tools by going to
the Data Mining tab and selecting the appropriate tool.
5. Evaluate Results: After you have applied data mining techniques to your data, you should evaluate
the results to determine their accuracy and usefulness. Excel has several tools for evaluating data mining results, such as the Confusion Matrix tool and the ROC Curve tool.
It’s important to note that while Excel can be useful for basic data mining tasks, it may not be sufficient for more complex tasks that require more powerful data mining tools and algorithms. In those
cases, you may need to use specialized data mining software or programming languages like Python
or R.
The data mining cycle is a process that describes the steps involved in performing data mining. It typically consists of the following stages:
1. Data Exploration: In this stage, the data is collected and pre-processed to prepare it for analysis.
This may involve cleaning the data, transforming it into a suitable format, and selecting the relevant
variables.
2. Data Preparation: In this stage, the data is prepared for analysis by selecting the appropriate techProf. Dr Balasubramanian Thiagarajan MS D.L.O.
niques and algorithms, setting up the analysis environment, and performing any necessary pre-processing steps.
3. Model Building: In this stage, the data is analyzed using various data mining techniques to build a
model. The model is used to identify patterns and relationships in the data that can be used to make
predictions or inform business decisions.
4. Model Evaluation: In this stage, the model is evaluated to determine its accuracy and effectiveness.
This may involve testing the model on a subset of the data, comparing it to other models, or performing cross-validation to ensure that the model is robust and reliable.
5. Deployment: In this stage, the model is deployed and used to make predictions or inform business
decisions. This may involve integrating the model into a larger system or providing it to end-users in
a user-friendly format.
6. Monitoring and Maintenance: In this stage, the model is monitored to ensure that it continues to
perform effectively over time. This may involve updating the model or retraining it with new data to
improve its accuracy and effectiveness.
The data mining cycle is an iterative process, which means that the results of each stage may lead
to new insights or changes in the data or analysis techniques. As a result, the cycle may need to be
repeated multiple times until the desired results are achieved.
Types of data mining:
Data mining refers to the process of discovering patterns and insights from large datasets using various analytical techniques. There are several types of data mining, including:
1. Classification: It involves identifying patterns in data that can be used to categorize it into different
groups or classes.
2. Clustering: This type of data mining involves grouping similar data points together based on their
characteristics.
3. Association rule mining: This type of data mining involves identifying relationships between different variables in a dataset.
4. Regression analysis: It involves analyzing the relationship between a dependent variable and one or
more independent variables to predict future trends.
5. Anomaly detection: This type of data mining involves identifying outliers or unusual data points
that do not fit into the general pattern of the data.
6. Sequence mining: It involves analyzing sequences of events or transactions to identify patterns or
trends over time.
7. Text mining: This type of data mining involves analyzing large collections of text data, such as
Mastering Statistical Analysis with Excel
432
emails, social media posts, and documents, to extract useful insights.
8. Web mining: It involves analyzing web data, such as web pages, links, and user behavior, to extract
useful insights.
These are some of the common types of data mining used in various fields such as business, healthcare, finance, and social media analysis.
Using Excel to classify data:
To use Excel for classification, you can follow these steps:
1. Import the data into Excel and organize it into a table with columns for each variable, including
the target variable (in this case, high-spending vs. low-spending).
2. Use the Data Analysis tool in Excel to perform a classification analysis. To do this, go to the Data
tab, select Data Analysis, and choose the appropriate classification tool, such as Logistic Regression,
Naive Bayes, or Decision Tree.
3. Configure the classification tool by selecting the input and output variables and adjusting any other relevant settings, such as the confidence level or the number of decision tree branches.
4. Run the analysis and review the output. The results will include a prediction of whether each customer is high-spending or low-spending, as well as an assessment of the accuracy of the prediction.
Analyze the output and refine the model as needed. For example, you may want to adjust the input
variables or try a different classification tool to improve the accuracy of the model.
Use the model to make predictions on new data. Once you have a reliable classification model, you
can use it to predict the spending habits of new customers based on their age, gender, and income
level.
Overall, Excel can be a useful tool for performing classification analysis, particularly for small to medium-sized datasets. However, for larger or more complex datasets, you may want to consider using
more specialized data mining tools or programming languages, such as R or Python.
Here’s an example of sample data that you can use for classification analysis in Excel:
ID
1
2
3
4
5
6
7
8
Age
25
38
42
30
55
47
20
29
Gender
Male
Female
Male
Female
Male
Female
Male
Female
Income Level Product 1
Low
1
Medium
0
High
1
Low
1
Medium
0
High
1
Low
0
Medium
1
Product 2
0
1
1
0
1
1
1
0
Product 3
1
1
0
0
1
1
0
1
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
High-Spending
0
1
1
0
1
1
0
1
9
10
35
42
Male High 1
Female Low 0
1
0
0
1
1
0
In this sample data, the first column represents the customer ID, and the remaining columns represent different variables, such as age, gender, income level, and product purchases. The last column,
“High-Spending,” is the target variable, which we want to classify customers as either “high-spending” (indicated by a value of 1) or “low-spending” (indicated by a value of 0).
You can import this data into Excel and organize it into a table with columns for each variable, as
described in the previous answer. Then, you can use Excel’s Data Analysis tool to perform a classification analysis and predict which customers are high-spending based on their age, gender, income
level, and product purchases.
Image showing sample data entered
Image showing Quick analysis tool icon
Mastering Statistical Analysis with Excel
434
Image showing quick analysis menu
Image showing data classification into High Low and Medium income groups using quick analysis
tool
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing High spenders color coded
Image showing High spending marked in Green and non high spending marked in pink
Mastering Statistical Analysis with Excel
436
Image showing Tables Menu under Quick analysis tab which can be used to generate pivot table
Image showing another way of categorizing data by clicking on the Analyze tab. It generates various
categorization scenario as shown. This process saves a lot of time on the part of the user.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Clustering data:
Clustering in statistics refers to the process of grouping a set of data points in a way that maximizes the similarity between points in the same group and minimizes the similarity between points in
different groups.
The goal of clustering is to find patterns or structures in the data that may not be immediately apparent to the naked eye. By grouping similar data points together, clustering can help to identify natural
subgroups or clusters within a larger dataset.
There are several methods for clustering data, including hierarchical clustering, k-means clustering,
and density-based clustering. These methods use different algorithms to group data points based on
various criteria such as distance, density, or similarity.
Clustering is commonly used in various fields, such as marketing, biology, and computer science, to
analyze data, find patterns, and make predictions.
Clustering is a widely used technique in various fields for analyzing data and identifying patterns.
Here are some scenarios in which clustering is commonly used:
1. Customer segmentation in marketing: Clustering is often used to group customers based on their
purchasing behavior, demographics, or other relevant characteristics. This helps companies to tailor
their marketing strategies and offers to specific customer segments.
2. Image and object recognition in computer vision: Clustering is used to group similar images or
objects together in computer vision applications, such as facial recognition or object detection.
3. Fraud detection in finance: Clustering can be used to identify unusual patterns in financial transactions, which can be an indication of fraud.
4. Gene expression analysis in biology: Clustering is used to group genes that have similar expression
patterns across different samples. This helps biologists to identify genes that may be involved in a
particular biological process.
5. Anomaly detection in network security: Clustering can be used to identify unusual network traffic
patterns, which may indicate a security threat.
6. Recommendation systems in e-commerce: Clustering can be used to group products or users
based on their characteristics or behavior, which can be used to make personalized recommendations.
Overall, clustering is a powerful tool for discovering patterns and relationships in complex datasets,
and it has many practical applications in various fields.
Mastering Statistical Analysis with Excel
438
Here’s a sample dataset for clustering analysis:
ID
1
2
3
4
5
6
7
8
9
10
Age
25
35
45
20
50
55
30
40
50
60
Income
35000
50000
65000
25000
80000
90000
40000
60000
75000
100000
These data points represent individuals with their age and income information. Now, let’s see how we
can perform clustering analysis in Excel:
Pivot table function of Excel can be used to cluster data. Clustering can be done either by using age
as the criteria of clustering or income levels as a criteria of clustering. After entering data in Excel
spread sheet the insert button is clicked and from the menu insert Pivot table is chosen. Pivot table
is created using the table or range of data.
Quick analysis tool can be used to perform clustering of data.
Image showing quick analysis button (blue circle)
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Quick Analysis icon is clicked to open up the submenu.
Image showing cluster listed under charts. Horizontal or vertical types can be chosen
Image showing cluster chart created
Mastering Statistical Analysis with Excel
440
Association rule mining:
Association rule mining is a technique used in data mining that analyzes the relationships between
different variables or items in a dataset. It is a way of discovering interesting patterns or relationships
between different data elements in a large database.
The basic idea behind association rule mining is to identify frequent patterns, which are sets of items
that occur together frequently in a given dataset. These patterns are then used to derive rules that
describe the relationships between different items in the dataset.
For example, if a supermarket wants to understand the buying habits of its customers, it might use
association rule mining to identify patterns in customer purchases. The supermarket might discover that customers who buy bread are also likely to buy milk, and that customers who buy chips are
likely to buy soda.
Association rule mining can be used in a wide range of applications, including market basket analysis, customer segmentation, fraud detection, and recommendation systems. It is a powerful tool for
uncovering hidden relationships in large datasets, and can provide valuable insights into the behavior of individuals or groups.
Association rule mining is used in a wide range of scenarios where analysts want to understand the
relationships between different variables or items in a dataset. Here are some common examples:
1. Market Basket Analysis: Association rule mining is widely used in retail settings to analyze customer purchasing patterns. By analyzing transaction data, retailers can identify which items are
frequently purchased together and use this information to optimize store layouts, product placement,
and promotions.
2. Customer Segmentation: Association rule mining can be used to segment customers based on
their purchasing behavior. By identifying groups of customers who tend to buy similar products,
marketers can target these groups with personalized marketing campaigns.
3. Fraud Detection: Association rule mining can be used to detect fraudulent behavior in financial
transactions. By analyzing transaction data, fraud analysts can identify patterns that are associated
with fraudulent behavior, such as a high volume of transactions from a single location.
4. Healthcare Analytics: Association rule mining is used in healthcare settings to identify patterns in
patient data. For example, healthcare providers can use association rule mining to identify risk factors for certain diseases or to identify which treatments are most effective for particular conditions.
5. Recommendation Systems: Association rule mining is widely used in recommendation systems,
such as those used by online retailers and streaming services. By analyzing customer purchase or
viewing history, these systems can recommend products or content that are likely to be of interest to
the customer based on their past behavior.
Overall, association rule mining is a powerful tool for identifying patterns and relationships in large
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
datasets, and can be used in a wide range of applications across different industries.
Here’s an example dataset that can be used for association rule mining:
Transaction ID
1
2
3
4
5
6
7
8
9
10
Items Purchased
A, B, C
A, C, D
B, D
A, C, D
A, B, D
A, B, D
A, B, C
A, C
B, D
A, B, D
In this dataset, we have 10 transactions where customers have purchased items A, B, C, and D. We
can use this dataset to discover the association rules between the items. For example, we may discover that customers who purchase item A are also likely to purchase item B or C.
Enter your data into an Excel worksheet. Each row should represent a transaction, and each column
should represent an item. You may want to use binary values (0 or 1) to indicate whether an item was
present in a transaction or not.
Convert the data into a format suitable for association rule mining. You may want to use the “PivotTable” feature in Excel to create a table that shows the frequency of each item and the combinations
of items that appear together.
Regression analysis:
Regression analysis is a statistical method used to determine the relationship between one or more
independent variables and a dependent variable. It is used to model the relationship between the independent and dependent variables and to predict the value of the dependent variable based on the
values of the independent variables.
Regression analysis can be used for both linear and non-linear relationships. In linear regression, the
relationship between the independent and dependent variables is assumed to be linear, which means
that the change in the dependent variable is directly proportional to the change in the independent
variable. In non-linear regression, the relationship between the independent and dependent variables
is assumed to be non-linear.
There are several types of regression analysis, including simple linear regression, multiple linear
regression, polynomial regression, logistic regression, and more. Simple linear regression involves
modeling the relationship between two variables, while multiple linear regression involves modeling the relationship between more than two variables. Polynomial regression involves modeling a
Mastering Statistical Analysis with Excel
442
non-linear relationship using a polynomial equation, and logistic regression is used to model the relationship between a dependent variable and one or more independent variables that are categorical.
Regression analysis is widely used in various fields, including economics, finance, marketing, and social sciences. It is used to make predictions, analyze relationships, and make decisions based on data.
Regression analysis can be used in various scenarios where there is a need to model the relationship
between one or more independent variables and a dependent variable. Some of the common scenarios where regression analysis is used include:
1.Predicting sales: Regression analysis can be used to model the relationship between sales and
various independent variables such as price, advertising, and promotion. This can help in predicting
future sales based on changes in these independent variables.
2. Forecasting demand: Regression analysis can be used to model the relationship between demand
for a product and various independent variables such as price, income, and demographic variables.
This can help in forecasting demand for a product and optimizing production and inventory management.
3. Analyzing marketing campaigns: Regression analysis can be used to model the relationship between marketing activities such as advertising, promotions, and social media engagement and the
resulting sales. This can help in analyzing the effectiveness of marketing campaigns and optimizing
marketing strategies.
4. Evaluating employee performance: Regression analysis can be used to model the relationship
between employee performance and various independent variables such as training, experience, and
job satisfaction. This can help in evaluating the effectiveness of training programs and identifying
factors that affect employee performance.
5. Predicting customer churn: Regression analysis can be used to model the relationship between
customer churn and various independent variables such as customer satisfaction, pricing, and service quality. This can help in predicting customer churn and identifying factors that affect customer
retention.
6. Analyzing financial data: Regression analysis can be used to model the relationship between financial variables such as stock prices, interest rates, and economic indicators. This can help in analyzing
financial trends and predicting future market movements.
Overall, regression analysis is a powerful tool that can be used in a wide range of scenarios where
there is a need to model the relationship between variables and make predictions based on data.
Here is a sample data set that can be used for regression analysis. The data set includes information
on the number of hours studied and the corresponding exam scores for a group of students.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Hours Studied Exam Score
2
3
4
5
6
7
8
9
10
60
75
85
90
95
98
100
105
110
Steps to Perform Regression Analysis in Excel:
1. Open Microsoft Excel and import the data set into a new worksheet.
2. Once the data is imported, select the “Data” tab from the ribbon and click on the “Data Analysis”
button. If the “Data Analysis” option is not available, you may need to enable it by going to “File” >
“Options” > “Add-Ins” and selecting “Analysis Toolpak.”
3. In the “Data Analysis” dialog box, select “Regression” and click “OK.”
4. In the “Regression” dialog box, enter the input range for the independent variable (hours studied)
and the output range for the dependent variable (exam score).
5. Select the “Labels” option if the data set includes column headers.
6. Choose the desired output options such as confidence interval and residuals.
7. Click on “OK” to run the regression analysis.
8. Excel will output the results in a new worksheet, including the regression equation, coefficients,
standard error, t-statistic, and p-value.
9. To visualize the regression line and data points, you can create a scatter plot with the independent
variable (hours studied) on the x-axis and the dependent variable (exam score) on the y-axis. Add a
trendline to the scatter plot to display the regression line.
Interpret the results to draw conclusions about the relationship between the independent and dependent variables.
Mastering Statistical Analysis with Excel
444
Image showing Data entered and Regression chosen from the submenu under Data analysis
Image showing Regression window where the input details have been filled
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing the Result screen
Anomaly Detection:
Anomaly detection is the process of identifying patterns or data points that deviate significantly from
the expected behavior or norm in a given data set. In statistics, an anomaly or outlier is a data point
that is significantly different from the other data points in the same data set.
Anomaly detection is an important task in many fields such as finance, manufacturing, and cybersecurity, where it is necessary to identify unusual behavior or events that may indicate fraud, errors, or
security breaches.
In statistical anomaly detection, various techniques are used to identify anomalies in data sets. These
techniques include:
1. Statistical methods: These methods involve calculating statistical measures such as mean, standard
deviation, and z-score to identify data points that deviate significantly from the expected values.
2. Machine learning methods: These methods involve using machine learning algorithms such as
clustering, classification, and regression to identify patterns in the data and detect anomalies.
3. Time-series methods: These methods are used to detect anomalies in time-series data such as
stock prices or weather patterns. They involve modeling the expected behavior of the data over time
and identifying data points that deviate significantly from the expected values.
4. Domain-specific methods: These methods involve using domain-specific knowledge and rules to
identify anomalies in data sets. For example, in manufacturing, anomalies may be identified based
on quality control rules or process limits.
Mastering Statistical Analysis with Excel
446
Overall, anomaly detection is an important task in statistics that involves identifying unusual patterns or data points in a given data set. By identifying anomalies, it is possible to take corrective
actions, improve quality control, and prevent fraud or security breaches.
Anomaly detection can be used in various scenarios where it is important to identify unusual behavior or events that may indicate fraud, errors, or security breaches. Here are some examples:
1. Credit card fraud detection: Anomaly detection is used in the banking industry to identify fraudulent transactions. By analyzing past transactions and identifying unusual spending patterns, banks
can detect potential fraud and take action to prevent financial losses.
2. Network intrusion detection: Anomaly detection is used in cybersecurity to identify unusual
network behavior that may indicate a security breach. By analyzing network traffic and identifying
unusual patterns, security teams can detect potential threats and take action to prevent or mitigate
the impact of a breach.
3. Manufacturing quality control: Anomaly detection is used in manufacturing to identify defects in
products. By analyzing production data and identifying unusual patterns, quality control teams can
detect potential defects and take action to prevent the production of defective products.
4. Predictive maintenance: Anomaly detection is used in maintenance to identify equipment failures
before they occur. By analyzing sensor data and identifying unusual patterns, maintenance teams can
detect potential equipment failures and take action to prevent costly downtime.
5. Medical diagnosis: Anomaly detection is used in medical diagnosis to identify unusual patient
behavior that may indicate a medical condition. By analyzing patient data and identifying unusual
patterns, doctors can detect potential health issues and take action to prevent or treat the condition.
Overall, anomaly detection is a useful tool in many different fields where it is important to identify
unusual behavior or events. By detecting anomalies early, it is possible to take corrective actions and
prevent potential losses or damages.
Excel can be used for anomaly detection by following these general steps:
1. Import data: First, import the data into Excel from a CSV or Excel file.
2. Identify variables: Identify the variables that are important for the anomaly detection task.
3. Clean and preprocess data: Clean and preprocess the data to remove any missing values, duplicates, or outliers that may affect the analysis.
4. Calculate descriptive statistics: Calculate descriptive statistics such as mean, median, standard
deviation, and quartiles for the variables.
5. Calculate z-score: Calculate the z-score for each data point based on the mean and standard deviation of the variable. A z-score measures the distance between a data point and the mean in terms of
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
standard deviations.
6. Identify anomalies: Identify data points that have a z-score that is greater than a certain threshold.
A high z-score indicates that the data point is significantly different from the mean and may be an
anomaly.
7. Visualize anomalies: Visualize the anomalies using charts or graphs to better understand the patterns and identify any trends.
Here’s an example of how to perform anomaly detection in Excel using the z-score method:
1. Open the Excel file and import the data into a worksheet.
2. Identify the variable that is important for the anomaly detection task, such as revenue or website
traffic.
3. Clean and preprocess the data by removing any missing values or outliers.
4. Calculate the mean and standard deviation for the variable using the AVERAGE and STDEV functions.
5. Calculate the z-score for each data point using the formula: (data point - mean) / standard deviation.
6. Set a threshold for the z-score. Any data point with a z-score greater than the threshold is considered an anomaly.
7. Visualize the anomalies using a scatter plot or other visualization tool to better understand the
patterns.
Keep in mind that Excel is a limited tool for anomaly detection and is not suitable for large or complex data sets. More advanced tools such as machine learning algorithms may be needed for more
sophisticated anomaly detection tasks.
Here is a sample dataset for anomaly detection in Excel:
Time Stamp
1
2
3
4
5
6
7
8
9
10
Value
10
12
15
9
8
12
10
11
10
13
Mastering Statistical Analysis with Excel
448
11
12
13
14
15
35
9
11
14
10
To perform anomaly detection on this dataset in Excel, you can follow these steps:
1. Open Excel and import the data into a new worksheet.
2. Calculate the mean and standard deviation of the “Value” column using the AVERAGE and STDEV functions in Excel. In this case, the mean is 12 and the standard deviation is 5.
3. Calculate the z-score for each data point using the formula: (data point - mean) / standard deviation. This will give you a measure of how many standard deviations each data point is from the
mean. For example, the z-score for the first data point (10) is -0.4, and the z-score for the 11th data
point (35) is 4.6.
4. Set a threshold for the z-score. Any data point with a z-score greater than the threshold is considered an anomaly. In this case, let’s set the threshold to 3, meaning that any data point with a z-score
greater than 3 will be considered an anomaly.
5. Identify the anomalies by highlighting any data points that have a z-score greater than the threshold. In this case, the 11th data point (35) has a z-score of 4.6, which is greater than the threshold of 3,
so it is considered an anomaly.
6. Visualize the anomalies using a line chart or other visualization tool to better understand the patterns. In this case, you can create a line chart with the “Time Stamp” on the x-axis and the “Value” on
the y-axis. You can then highlight the anomalous data point (11th data point) on the chart to see how
it deviates from the rest of the data.
Note that this is a simple example of anomaly detection in Excel and may not be suitable for more
complex or larger datasets. More advanced techniques such as machine learning algorithms may be
needed for more sophisticated anomaly detection tasks.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Data entered and average of the value column calcualted using inbuilt average function in Excel
Image showing Average of value column calculated and displayed on pressing Enter key
Mastering Statistical Analysis with Excel
450
Image showing formula for calculating standard deviation of value column entered
Image showing standard deviation of value column calculated and entered on pressing Enter key
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing formula for calculating z value of the first data set entered On pressing Enter key the
calculated formula would be displayed inside the cell.
Image showing Z value for value column calculated
Mastering Statistical Analysis with Excel
452
Image showing Anamolous value cell colored brown
Sequence mining:
Sequence mining is a data mining technique used to identify frequent patterns or sequences of
events in data sets that are ordered in time or space. It is a type of pattern mining that is commonly
used in fields such as marketing, finance, and healthcare.
Sequence mining is used to identify patterns or trends that can help in decision making or predictive
modeling. It involves finding the most frequent subsequences or patterns in a sequence database,
which can then be used to make predictions or identify anomalies in the data.
Sequence mining can be used to answer questions such as:
What are the most frequent patterns of events that occur in a particular sequence (e.g. website clicks,
purchase history, medical treatments)?
What are the most frequent sequences of events that lead to a particular outcome (e.g. successful
sales, patient recovery, or customer churn)?
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
How can we predict future events based on past patterns or sequences?
Sequence mining is commonly used in fields such as market basket analysis, clickstream analysis,
fraud detection, and customer behavior analysis. It is typically performed using specialized software
or programming languages such as R or Python.
Scenarios where sequence mining is used:
Sequence mining can be used in a variety of scenarios where the order of events or transactions is
important. Here are some examples:
1. Market Basket Analysis: Sequence mining is used to identify frequently occurring sequences of
items that are purchased together. This can be used to suggest complementary products, optimize
product placement in stores, and make personalized recommendations to customers.
2. Clickstream Analysis: Sequence mining can be used to analyze the order in which users navigate
through a website or application, and identify patterns of behavior that lead to conversion or dropoff. This can be used to optimize the user experience and improve conversion rates.
3. Healthcare: Sequence mining can be used to analyze medical records and identify patterns of
treatment that lead to positive outcomes or negative side effects. This can be used to improve patient
care and optimize treatment plans.
4. Fraud Detection: Sequence mining can be used to analyze financial transactions and identify
patterns of behavior that are indicative of fraudulent activity. This can be used to detect and prevent
fraud in industries such as banking and insurance.
5. Customer Behavior Analysis: Sequence mining can be used to analyze customer behavior, such
as the order in which they interact with a website or purchase products. This can be used to identify
patterns that are associated with customer churn, and develop targeted retention strategies.
Overall, sequence mining is a powerful tool for identifying patterns and trends in data sets that are
ordered in time or space, and can be applied in a wide range of industries and scenarios.
Sequence mining is typically performed using specialized software or programming languages such
as R or Python. While it is possible to perform basic sequence mining in Excel, it may not be the
most efficient or effective tool for the task. That being said, here is an example of how to perform
basic sequence mining in Excel:
Sample Data:
Let’s assume that we have a dataset of customer purchase history in a grocery store, consisting of the
following columns:
Customer ID: unique identifier for each customer
Purchase Date: date of the purchase
Product Name: name of the product purchased
Mastering Statistical Analysis with Excel
454
Here is a sample of what the data might look like:
Customer ID Purchase Date Product Name
1001
1001
1001
1002
1002
1002
1003
1003
1003
01/01/2022
01/01/2022
02/01/2022
01/01/2022
02/01/2022
02/01/2022
01/01/2022
02/01/2022
02/01/2022
Apples
Bananas
Apples
Bananas
Apples
Oranges
Apples
Bananas
Oranges
Steps to perform sequence mining in Excel:
1. Transform the data: The first step is to transform the data into a format that can be used for sequence mining. This typically involves converting the data into a transactional format, where each
row represents a transaction (e.g. a customer’s purchase history), and the products purchased are
listed in columns.
2. Create a frequency table: Next, create a frequency table that shows the frequency of each product
and each product combination. This can be done using Excel’s PivotTable feature. In the PivotTable,
drag the “Product Name” column into the Rows field, and drag it again into the Values field. This will
create a table that shows the frequency of each product.
3. Identify frequent sequences: Using Excel’s conditional formatting feature, highlight the cells in
the frequency table that represent frequent sequences (e.g. sequences that occur more than a certain
number of times). This will allow you to easily identify the most frequent product sequences.
4.Analyze the results: Finally, analyze the results to identify patterns and trends in the data. This may
involve visualizing the data using Excel’s charting features, or performing additional analysis using
other tools or programming languages.
Note that this is a very basic example of sequence mining in Excel, and more complex scenarios may
require specialized software or programming languages. However, this should give you an idea of
how to perform basic sequence mining using Excel.
Text mining:
Text mining, also known as text analytics, is the process of analyzing large amounts of unstructured
textual data to extract relevant information and insights. This data can be in various forms such as
emails, social media posts, customer feedback, news articles, and other forms of text-based information.
Text mining involves the use of natural language processing (NLP) techniques and machine learning
algorithms to identify patterns and relationships in the text data. Some common text mining techProf. Dr Balasubramanian Thiagarajan MS D.L.O.
niques include text categorization, sentiment analysis, topic modeling, named entity recognition, and
text clustering.
The insights gained from text mining can be used for a variety of purposes, such as market research,
customer feedback analysis, fraud detection, risk management, and more. Text mining has become
an important tool for businesses and organizations to gain valuable insights from the vast amounts of
text-based data available today.
Text mining can be used in a variety of scenarios to extract insights and information from large volumes of text-based data. Here are a few examples:
1. Social Media Analysis: Companies can use text mining techniques to analyze social media conversations about their products or services. This analysis can help them understand customer sentiment,
identify key trends, and develop targeted marketing campaigns.
2. Customer Feedback Analysis: Text mining can also be used to analyze customer feedback and
reviews, such as those found on e-commerce websites. This analysis can help companies identify
common customer complaints, areas for improvement, and new product ideas.
3. Fraud Detection: Text mining can be used to identify patterns of fraudulent behavior in large
volumes of financial transaction data. This analysis can help financial institutions detect fraudulent
activity and prevent losses.
4. Medical Research: Text mining can be used to analyze large volumes of medical research papers to
identify trends, relationships, and patterns. This analysis can help researchers identify new treatment
options, develop new drugs, and improve patient outcomes.
5. Legal Analysis: Text mining can be used in legal analysis to identify relevant case law and precedents. This analysis can help lawyers develop legal arguments, identify potential issues, and make
more informed decisions.
Overall, text mining can be used in any scenario where large volumes of text-based data need to be
analyzed to extract insights and information.
Here’s an example of sample data for text mining:
Suppose you have a set of customer reviews of a restaurant. Each review is a text document, and you
want to analyze these reviews to gain insights into customer sentiment, the most common topics
mentioned, and any issues that customers may have experienced.
Here are the steps you can follow to use Excel for text mining:
1. Import the data: You can import the customer review data into Excel by opening a new workbook
and selecting the “Data” tab. From there, select “From Text/CSV” and browse to the location of your
data file. Follow the prompts to import the data into Excel.
2. Clean the data: Before you can analyze the text data, you need to clean it by removing any unnec-
Mastering Statistical Analysis with Excel
456
essary characters, punctuation, and stop words. You can use Excel’s text functions and filters to clean
the data.
3. Tokenize the text: Next, you need to tokenize the text data by splitting it into individual words or
phrases. You can use Excel’s text functions and the “Text to Columns” feature to tokenize the data.
4. Perform text analysis: Once the text data is tokenized, you can perform text analysis using Excel’s built-in functions or add-ins. For example, you can use the “COUNTIF” function to count the
number of times each word or phrase appears in the data, or you can use the “Word Cloud” add-in
to visualize the most common words.
5. Extract insights: Finally, you can extract insights from the text data by analyzing the results of your
text analysis. For example, you may find that customers frequently mention a particular menu item,
indicating that it is popular. Alternatively, you may find that customers frequently mention a particular issue, indicating that it needs to be addressed.
Overall, text mining in Excel involves importing the data, cleaning it, tokenizing it, performing text
analysis, and extracting insights. Excel provides a range of built-in functions and add-ins that can be
used to perform these tasks. However, for more advanced text mining tasks, it may be necessary to
use specialized text mining software or programming languages such as Python or R.
web mining:
Web mining, also known as web data mining, is the process of extracting useful information and
insights from the vast amount of data available on the World Wide Web. Web mining is a broad field
that encompasses several different techniques and methodologies for analyzing web data, including
web content mining, web structure mining, and web usage mining.
Web content mining involves extracting information and knowledge from the textual content of web
pages. This can include analyzing the content of web pages to identify patterns, trends, and relationships between different pieces of information. Web content mining can also involve extracting specific pieces of information, such as product prices or contact information, from web pages.
Web structure mining involves analyzing the links between web pages to identify patterns and relationships between them. This can include analyzing the structure of web pages to identify patterns
in the way that different pages are linked together, or analyzing the structure of entire websites to
identify patterns in the way that different sections of the website are organized.
Web usage mining involves analyzing user behavior on websites to identify patterns and trends in the
way that users interact with web pages. This can include analyzing web server logs to identify which
pages are most frequently accessed, which pages have the highest bounce rate, or which pages are
most commonly visited by users from a particular geographic region.
Overall, web mining is an important tool for businesses and organizations to gain insights into the
behavior of users on the web, as well as to extract valuable information and insights from the vast
amount of data available on the web.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Web mining using Excel can be challenging, as Excel is not designed for web data analysis. However,
there are some ways in which you can use Excel to perform basic web mining tasks. Here are some
steps you can follow:
1. Import the web data into Excel: You can use the “From Web” option in Excel to import data from
a web page. To do this, go to the “Data” tab in Excel, select “From Web”, and enter the URL of the
web page you want to import data from. You can then select the data you want to import, and click
“Import” to bring the data into Excel.
2. Clean and preprocess the data: Once you have imported the web data into Excel, you will likely
need to clean and preprocess it to prepare it for analysis. This can include removing any unwanted
characters or symbols, converting data types, and normalizing data formats.
3. Perform basic analysis: With the web data cleaned and preprocessed, you can then perform basic
analysis using Excel’s built-in functions and tools. For example, you can use the “COUNTIF” function to count the number of times a specific keyword appears in the web data, or you can use the
“Filter” function to extract specific rows of data based on certain criteria.
4. Visualize the data: Once you have analyzed the web data, you can use Excel’s charting and visualization tools to create charts and graphs that help you better understand the data. For example, you
can use the “PivotTable” feature to summarize the data, and then create a chart based on the summary.
Overall, while Excel can be used to perform some basic web mining tasks, it is not the best tool for
this purpose. For more advanced web mining tasks, you may need to use specialized web mining
software or programming languages such as Python or R.
Mastering Statistical Analysis with Excel
458
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
29
E
Importing data into Excel
xcel can import various types of data formats including:
1. CSV (Comma Separated Values)
2. TXT (Plain Text)
3. XLSX/XLSM (Excel Workbook with macros)
4. XML (Extensible Markup Language)
5. JSON (JavaScript Object Notation)
6. Access Database
7. Web Pages (HTML)
Excel can also import data from other sources such as SQL databases, SharePoint lists, and ODBC
data sources. However, the process for importing these types of data may be different than importing a
file directly into Excel.
To import data into Excel, you can follow these steps:
1. Open a new or existing Excel workbook.
2. Click on the “Data” tab in the ribbon at the top of the screen.
3. Select “From Text/CSV” if you are importing a CSV file, or “From File” if you are importing another
type of file.
4. Navigate to the location of the file you want to import, and select it.
5. Follow the prompts to choose the delimiter used in your file (e.g., comma, tab, semicolon), specify
any data format options (e.g., date/time format), and choose where you want to import the data (e.g., a
new worksheet or an existing one).
6. Click “Finish” to complete the import process.
Note that the exact steps may vary slightly depending on the version of Excel you are using, but the
general process should be similar.
Mastering Statistical Analysis with Excel
460
While importing data into Excel, there can be a few common problems that one may face. Here are
some of them:
1. Incorrect Data Format: Sometimes, the data in the imported file may not be in the correct format.
For example, dates may be in a different format than what Excel expects, causing Excel to interpret
them as text instead of dates. This can be resolved by adjusting the data format options during the
import process.
2. Special Characters: If the imported file contains special characters that Excel does not recognize,
such as non-English characters or symbols, then the characters may not be displayed correctly in Excel. In such cases, one can try to change the encoding type or use a third-party tool to convert the file
to a more compatible format.
3. Large Data Sets: If the imported file is too large, it may take a long time for Excel to import it, or Excel may even crash due to insufficient memory. To solve this issue, one can try importing only a subset
of the data, or breaking up the data into smaller files.
4. Data Quality Issues: The imported data may have data quality issues such as missing values, duplicates, or inconsistencies. To address this, one can use Excel’s built-in data cleaning tools or perform
data cleaning operations outside of Excel before importing the data.
By being aware of these common issues, one can take appropriate measures to avoid or address them
while importing data into Excel.
Importance of importing data into excel:
Importing data into Excel can be important for a number of reasons, including:
1. Data Analysis: Excel is a powerful tool for data analysis, and importing data into Excel allows you to
take advantage of its data analysis features. You can use Excel to sort, filter, and analyze large data sets,
and to create charts and graphs to visualize the data.
2. Data Management: Excel can be used to manage and organize data, making it easier to access and
work with. By importing data into Excel, you can store it in a structured format, making it easier to
search and retrieve specific data points as needed.
3. Collaboration: Excel allows multiple users to work on the same file simultaneously, making it a
great tool for collaboration. By importing data into Excel, you can share the data with others and work
together to analyze and manage it.
4. Data Entry: Importing data into Excel can be faster and more accurate than manually entering data.
This is particularly true for large data sets, where manually entering data can be time-consuming and
error-prone.
5. Integration with other tools: Excel can be integrated with other tools such as Power BI, which can
provide advanced data visualization and analytics capabilities. By importing data into Excel, you can
take advantage of these integrations to create more powerful data analysis and management solutions.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Overall, importing data into Excel can be a valuable step in the data analysis and management process, providing a structured and organized way to work with large data sets.
Image showing data tab clicked
Image showing get data submenu
Mastering Statistical Analysis with Excel
462
Image showing data imported into Excel
Image showing imported data displayed as a table
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Once you have imported data into Excel, there are a number of processing tasks that you can perform. Here are some common data processing tasks in Excel:
1. Filtering and Sorting: You can use the filter and sort functions to quickly and easily find specific
pieces of data or organize your data in a specific way. To sort, select the range of cells you want to
sort, click on the Sort button on the Home tab, and choose the options that you want. To filter, select
the range of cells you want to filter, click on the Filter button on the Home tab, and choose the options that you want.
2. Data Cleaning: Excel has a number of tools for cleaning and formatting your data. For example,
you can use the Find and Replace function to find specific characters or strings of text and replace
them with something else. You can also use the Text to Columns function to split text into separate
cells based on a delimiter.
3. Data Analysis: Excel has a number of built-in tools for data analysis, including PivotTables, PivotCharts, and various statistical functions. These tools can help you to summarize and analyze your
data in a variety of ways.
4. Calculations: Excel allows you to perform calculations on your data using formulas and functions.
For example, you can use the SUM function to add up a range of cells, the AVERAGE function to
find the average value of a range of cells, or the COUNT function to count the number of cells that
contain a certain value.
5. Charting: Excel has a number of charting tools that allow you to create charts and graphs based on
your data. To create a chart, select the range of cells that you want to include in the chart, click on the
Insert tab, and choose the type of chart that you want to create.
These are just a few of the many ways that you can process data in Excel. With a little practice, you’ll
be able to use Excel to analyze and visualize your data in a variety of ways.
By clicking on the down arrow button in the header cell submenu to sort the data can be accessed.
Mastering Statistical Analysis with Excel
464
Image showing the result of clicking on the down arrow icon next to petal width. From the submenu
data can be sorted as per the need.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
30
Data Transformation
ata transformation is the process of converting raw data into a more usable format for analysis or
processing. The goal of data transformation is to ensure that the data is in a format that is easier to
work with and analyze. Data transformation involves a variety of tasks, including:
D
1. Data Cleaning: This involves removing or correcting errors in the data, such as missing values, duplicate records, or formatting inconsistencies.
2. Data Integration: This involves combining data from multiple sources into a single dataset. This may
involve merging data from different databases, files, or spreadsheets.
3. Data Aggregation: This involves summarizing data at a higher level of granularity. For example, you
might summarize sales data by product category or by region.
4. Data Normalization: This involves organizing data in a consistent manner to reduce redundancy
and improve data integrity. For example, you might normalize data by ensuring that each field contains only one type of data.
5. Data Encoding: This involves converting data from one format to another. For example, you might
encode text data as numeric values so that it can be used in a machine learning algorithm.
Data transformation is an important step in the data processing pipeline. By transforming data into a
more usable format, analysts can gain insights that might not have been apparent from the raw data.
Excel provides a variety of tools to transform data, depending on your needs. Here are some common
ways to transform data in Excel:
1. PivotTables: PivotTables allow you to summarize and group data based on specific fields. You can
use PivotTables to summarize data by category, calculate totals, and create calculated fields. To create a
PivotTable, select the data range you want to summarize, go to the Insert tab, and click on PivotTable.
2. Text to Columns: If you have data in a single column that needs to be separated into multiple columns, you can use the Text to Columns tool. This feature allows you to split data based on a delimiter,
such as a comma or space. To use Text to Columns, select the column you want to split, go to the Data
tab, and click on Text to Columns.
Mastering Statistical Analysis with Excel
466
3. Conditional Formatting: Conditional formatting allows you to highlight cells based on specific conditions. For example, you can use conditional formatting to highlight cells that contain specific text,
values above or below a certain threshold, or duplicate values. To use conditional formatting, select
the cells you want to format, go to the Home tab, and click on Conditional Formatting.
4. Formulas and Functions: Excel includes a wide range of built-in formulas and functions that allow
you to manipulate and transform data. You can use formulas and functions to perform calculations,
create conditional statements, and manipulate text. Some of the most commonly used functions include SUM, AVERAGE, IF, and CONCATENATE.
5. Transpose: If you have data arranged in rows that needs to be in columns, or vice versa, you can use
the Transpose feature. This feature allows you to switch the orientation of your data. To use Transpose,
select the data you want to transpose, copy it, then right-click the cell where you want to paste the
transposed data and select “Transpose” under the Paste Options.
These are just a few of the many ways to transform data in Excel. By using the appropriate tools and
techniques, you can quickly and easily manipulate and analyze your data in a variety of ways.
Here’s a sample dataset that we can use to demonstrate data cleaning in Excel:
Order ID
001
002
003
004
005
006
007
008
009
010
Customer Name
John Smith
Jane Doe
John Smith
Bob Johnson
Sarah Williams
Jane Doe
John Smith
Bob Johnson
Sarah Williams
Jane Doe
Product
Widget
Widget
Gadget
Gadget
Widget
Widget
Gizmo
Gadget
Widget
Gizmo
Quantity
5
3
2
4
6
2
1
2
3
1
Price per Unit Total Price
$10.00
$50.00
$8.50
$25.50
$15.00
$30.00
$13.50
$54.00
$9.00
$54.00
$8.50
$17.00
$20.00
$20.00
$12.00
$24.00
$9.00
$27.00
$18.00
$18.00
Now let’s go through some examples of data cleaning tasks that can be performed in Excel:
1. Remove Currency Symbols: In the dataset, the price per unit and total price columns include a
dollar sign. To remove the dollar sign, select the entire column, click on the “Home” tab, click on the
“Find & Replace” button, and replace “$” with “”.
2. Convert Text to Numbers: In the price per unit and total price columns, the data is formatted as
text. To convert the text to numbers, select the entire column, click on the “Data” tab, click on “Text to
Columns”, select “Delimited”, and choose “None” as the delimiter.
3.Remove Duplicates: In the customer name column, there are some duplicate values. To remove
duplicates, select the entire column, click on the “Data” tab, click on “Remove Duplicates”, and choose
the column to remove duplicates from.
4. Fill in Missing Values: In the product column, there is a missing value in row 7. To fill in the missProf. Dr Balasubramanian Thiagarajan MS D.L.O.
ing value, select the cell, click on the “Data” tab, click on “Flash Fill”, and Excel will automatically fill
in the missing value based on the pattern of the adjacent cells.
5. Remove Extra Spaces: In the product column, there are extra spaces before and after some of the
values. To remove the extra spaces, select the entire column, click on the “Data” tab, click on “Trim”,
and Excel will remove any extra spaces.
These are just a few examples of data cleaning tasks that can be performed in Excel. By performing
these tasks and others like them, you can ensure that your data is clean, accurate, and ready for analysis.
Image showing sample data imported into Excel
Mastering Statistical Analysis with Excel
468
Image showing Excel column converted into a table by pressing CTRL and T Keys together. Note the
data that needs to be inside the table by selecting the data. If the data contains headers then the box
in front of My table has headers should be checked.
Image showing the use of Find and Replace icon that can be used to find $ and remove it.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
Image showing Dollar entered into find what and Replace with Empty field
Image showing Excel confirmation showing that the desired changes have been made
Image showing the Table with the desired data changes ($ sign has been removed)
Mastering Statistical Analysis with Excel
470
Image showing Text to column icon clicked
Image showing Delimited chosen
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
31
Analyze Data
Tab in Excel
I
n addition to inputting data into a spreadsheet, one of the most frequent activities that individuals engage in is data analysis. However, did you know that Microsoft Excel has a built-in feature that caters
specifically to this task? This feature, formerly known as Ideas, is now called Analyze Data, and it can aid
in identifying patterns, trends, rankings, and other insights. Analyze Data is accessible to Microsoft 365
subscribers on Windows, Mac, and the web.
With Analyze Data in Excel, you have the ability to comprehend your data using natural language
queries, enabling you to inquire about your data without the need to create complex formulas. Furthermore, Analyze Data offers comprehensive visual overviews of trends, patterns, and summaries.
How to perform data analysis using Analyze data feature in Excel?
To analyze data in Excel, begin by choosing a cell within a data range, then click on the Analyze Data
button located on the Home tab. Excel’s Analyze Data feature will then generate insightful visuals related to your data in a task pane. For more specific information, simply type a question into the query
box at the top of the pane and press Enter, and Analyze Data will return answers complete with tables,
charts, or PivotTables that can be added to the workbook. Additionally, Analyze Data offers personalized suggested questions to further explore the data, accessible by selecting the query box.
Image showing the location of Analyze Data
Mastering Statistical Analysis with Excel
472
Image showing the database that needs to be analyzed selected and Analyze Data tab clicked. It
demonstrates the results of various analysis of the selected data that has been automatically performed
by Excel.
Image showing analysis of Petal length and sepal width performed and the result displayed automatically
Prof Dr Balasubramanian Thiagarajan MS D.L.O
Image showing Scatter plot produced
Image showing Analyze Data answering the question posed to it
Mastering Statistical Analysis with Excel
474
Below are some possible reasons why Analyze Data may not function on your data, along with suggested workarounds:
1. Analyze Data cannot currently process datasets containing more than 1.5 million cells, and there
is currently no solution for this issue. In the meantime, you may filter your data and then copy it to a
different location to use Analyze Data.
2. If you use string dates such as “2017-01-01,” they will be interpreted as text strings by Analyze
Data. To remedy this, you can create a new column that employs the DATE or DATEVALUE functions and then format it as a date.
3. Analyze Data cannot operate when Excel is in compatibility mode (e.g., when the file is in .xls
format). As an alternative, save the file in .xlsx, .xlsm, or .xlsb format.
4. Merged cells may also be difficult to work with. If you want to center data, such as a report header,
you can remove all merged cells and then use Center Across Selection to format the cells. To do this,
press Ctrl+1, then go to Alignment > Horizontal > Center Across Selection.
Image showing data ranked according to the width of the petal
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
32
Grouping function in
Excel
roup function in Excel allows you to group a selected range of cells or rows based on a common
value in a specific column. This can be particularly useful when working with large data sets and
you need to analyze or summarize data based on a certain criteria.
G
Here are some common use cases for group function in Excel:
1. Summarize data: If you have a large data set and want to summarize it based on a common value in
a specific column, you can use the group function to group the data by that value. For example, you
could group sales data by month or by product category to see the total sales for each group.
2. Hide details: Grouping can also be used to hide details and focus on summary data. When you
group a set of rows or columns, you can collapse them to show only the summary data. This can be
particularly useful when you have a large data set and want to focus on the key information.
3.Subtotal function: The Subtotal function in Excel can be used to automatically calculate subtotals for
groups of data. By grouping data first, you can easily apply the Subtotal function to calculate subtotals
for each group.
4. Filtering data: Grouping can also be useful when filtering data. If you want to filter data based on
a specific value in a column, you can group the data by that column and then apply a filter to the
grouped data. This will allow you to filter the data for each group separately.
Overall, the group function in Excel can be a powerful tool for summarizing and analyzing large data
sets. It can help you to quickly understand patterns and trends in your data, as well as make it easier to
work with and manipulate.
Data grouping is an important tool for organizing and summarizing data in a logical and easy-to-understand format. Here are some of the main reasons why data grouping is important:
1. Improves data analysis: Grouping data can help to simplify complex data sets, making it easier to
analyze and identify trends or patterns. By grouping data based on common characteristics or categories, you can identify similarities and differences between groups and gain insights that might not be
immediately apparent when looking at the data as a whole.
2. Enhances data visualization: Grouping data can also make it easier to visualize the data using charts
Mastering Statistical Analysis with Excel
476
or graphs. By grouping data into categories, you can create bar charts, pie charts, or other visualizations that provide a clear picture of the data and its relationships.
3. Increases efficiency: Grouping data can also increase efficiency by allowing you to focus on key data
points and reducing the amount of time and effort needed to analyze the data. By grouping data based
on specific criteria, you can quickly identify trends and patterns without having to sift through large
amounts of data.
4. Simplifies reporting: Grouping data can also simplify the reporting process by allowing you to summarize data in a clear and concise manner. By grouping data based on common characteristics or categories, you can create summary reports that provide a quick overview of the data and its key findings.
Overall, data grouping is an important tool for organizing, summarizing, and analyzing data. It can
help to simplify complex data sets, improve data analysis, enhance data visualization, increase efficiency, and simplify reporting. By using data grouping effectively, you can gain valuable insights and make
informed decisions based on your data.
You can group data in Excel using the following steps:
1. Select the range of cells or rows that you want to group.
2. Right-click on the selected cells and choose “Group” from the drop-down menu. Alternatively, you
can use the keyboard shortcut “Ctrl + Shift + G”.
3. In the “Grouping” dialog box that appears, select the column or row that you want to group by. For
example, if you want to group data by month, select the column containing the month names.
4. Set the starting and ending values for each group. For example, if you want to group data by month,
you can set the starting value as “January” and the ending value as “December”.
5. Click “OK” to group the data.
Once the data is grouped, you can collapse or expand the groups by clicking the plus or minus sign
next to the group headings. You can also apply functions such as SUM or AVERAGE to calculate summary data for each group.
To ungroup the data, select the grouped cells and right-click, then choose “Ungroup” from the dropdown menu or use the keyboard shortcut “Ctrl + Shift + J”.
There are several types of data grouping that you can use in Excel, including:
1. Grouping by dates: This involves grouping data by date ranges, such as by month, quarter, or year.
This is particularly useful when working with time-series data.
2. Grouping by text or numbers: This involves grouping data based on common text or numeric values
in a column, such as grouping sales data by product category or grouping employee data by department.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.
3. Grouping by custom lists: This involves grouping data based on a custom list of values that you
define. For example, you could create a custom list of product names and group sales data by product
using that list.
4. Grouping by hierarchy: This involves grouping data based on a hierarchical structure, such as by
country, region, and city. This is particularly useful when working with geographic data.
5. Grouping by intervals: This involves grouping data based on specific intervals or ranges, such as
grouping age data into age ranges or grouping price data into price ranges.
Each type of grouping has its own advantages and can be used to analyze data in different ways. The
choice of grouping method will depend on the nature of the data being analyzed and the questions
being asked.
Sure, here’s an example of how you could use grouping to analyze sales data by product category:
Product Category
Electronics
Clothing
Home Decor
Electronics
Home Decor
Clothing
Home Decor
Electronics
Sales Amount
$2,500
$1,200
$1,800
$3,000
$2,400
$1,500
$1,200
$4,500
To group this data by product category:
Select the range of cells containing the data (in this case, A2:B9).
Right-click on the selected cells and choose “Group” from the drop-down menu.
In the “Grouping” dialog box that appears, select “By Column” and choose “Product Category” from
the drop-down menu.
Set the starting value as “Electronics” and the ending value as “Home Decor”.
Click “OK” to group the data.
After grouping the data, you can collapse or expand each group to view the summarized data for
each category:
Mastering Statistical Analysis with Excel
478
Product Category
Electronics
Clothing
Home Decor
Sales Amount
$10,000
$2,700
$5,400
In this example, you can see that the sales amount has been summarized for each product category,
making it easier to analyze the data and identify trends. By grouping the data in this way, you can
quickly see that Electronics is the top-selling category, followed by Home Decor and Clothing.
Prof. Dr Balasubramanian Thiagarajan MS D.L.O.