Lecture 1 EDA 2023
Lecture 1 EDA 2023
Lecture 1 EDA 2023
Learning Objectives
At the end of the lecture the student is expected to understand the following:
Statistics
Distinguish between the two types of Data
Identify different types of qualitative data – nominal, ordinal
Identify the different types of quantitative data – discrete and Continuous data
The Difference Between Population and a Sample of the Population
The Difference Between a Parameter and a Statistic
The Difference Between Descriptive Statistics and Inferential Statistics
Statistics is concerned with scientific methods for collecting, organizing, summarizing, presenting, and
analyzing data as well as with drawing valid conclusions and making reasonable decisions on the basis of such
analysis.
Data is a collection of facts, such as numbers, words, measurements, observations or even just descriptions of
things. The kind of statistical analysis undertaken depends on the type of data at hand. The raw material of
statistics is data. This raw material (data) is needed to meet a wide spectrum of objectives of a typical
engineering experiment/research.
Data simply refers to the actual measurement (either with or without the aid of an instrument) taken from
specified entities (systems, units) according to certain standard procedures.
Data collection is usually made in response to an identified need such as gathering information for decision
making. Data collection process can be simple or complex and may involve substantial number of personnel or
devices, usually from different disciplines addressing numerous levels of details e.g. statisticians, engineers,
project managers etc.
1
1.4 Qualitative Data
Qualitative data, also known as nominal or categorical data, are the simplest form of data. Qualitative data
describes a particular characteristic of a sample item. The values are non-numerical and most often have no
units. Examples of qualitative data are variables such as gender (male or female) or the grade in a course
(A, B, C, D or F).
Data that are created by assigning numbers to different categories when the numbers have no real meaning
are called nominal data. For example in a survey that asks for gender, a 0 is often used to denote “male”
and a 1 is used to denote “female”. The numbers are used simply to represent different categories and have
no real meaning as numbers.
The order in which numbers are assigned to qualitative data may have some meaning. When numbers are
used to name ordered categories, the data are called ordinal. Data that are created by assigning numbers to
categories where the order of assignment has meaning are called ordinal data. For example: when the
income of a sample is classified as 1 = low, 2 = medium, and 3 = high. Here the numbers have ordering, but
there is no way to compare 1 to 2 to 3 numerically.
Data that are inherently numerical are called quantitative data. These types of data usually have units
Quantitative data can also be Discrete or Continuous:
Discrete data (are countable) can only take certain values (like whole numbers) (e.g. number of products,
number of workers, number of shifts in a day etc.).
Continuous data can take any one of an infinite number of possible values over an interval on the number
line. Continuous data usually consist of real numbers (e.g. temperature, concentration, pH, pressure,
viscosity etc.).
2
Put simply: Discrete data is counted, Continuous data is measured
The population is everyone or everything you wish to study in order to make the necessary decision. Often,
when we study a population, we are interested in knowing about the different characteristics of each member of
the population. These characteristics are known as variables. A variable is a characteristic of each member of
the population.
A sample is a piece of the population. If you take a sample, you take a small piece of the population and look at
it or test it.
Why should we bother to examine a sample when what we really want to know about is the population; - Most
of the time we cannot study the entire population, so we must use a sample as a guide. The main reasons why
we can study entire populations are:
For example, a company wants to study the productivity of its employees before and after casual dress is
permitted in the workplace. If the company has 5000 employees all doing different jobs, it would be virtually
impossible to measure the productivity of all of them.
By understanding and studying a sample, we gain insight and knowledge about the population.
For example, a company that is thinking of adopting casual dress in the workplace would like to know the
percentage of employees whose productivity increased as a result of this policy change. The parameter here is a
percentage or a proportion.
Since we most likely will not know the value of the parameters needed to describe the population, we must
resort to using the information contained in the sample. A numerical descriptor of a sample is useful to estimate
the corresponding measure for the population. So numbers to describe the sample are needed for two reasons:
1. To paint a picture of the sample
2. To help estimate the corresponding population parameters.
What do we do with data that is collected on a sample from a population? What we can do depends on the
type of data we have and the questions we want answered about the behavior of the population. Tools that
can be used to answer questions about the population fall into two main categories: the tools of descriptive
statistics and the tools of inferential statistics.
Descriptive Statistics
The tools of descriptive statistics are usually the first ones used in any data analysis. They allow you to
describe the sample. The tools of descriptive statistics allow you to summarize the data.
Graphical and visual descriptive tools include bar charts, pie charts, and histograms. Graphical tools help
you to see how the data behave and to summarize the data visually.
Numerical descriptive tools allow you to summarize the data numerically. Typically a numerical summary
provides such statistics as the average, the median, the mode and the largest and smallest data values.
Descriptive tools are adequate if all we wanted to do was to describe the sample. But it is not the sample that
we care about but the population. The tools of inferential statistics are needed to leap from the information
contained in a sample to the behavior of the population.
Inferential Statistics