Lesson1 1446

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Lesson 1.

Biostatistics: principles and


terminology. Data types. Frequency
distribution. Graphical presentation of
data
Penina Olga, dr. in medicine
USMF ”Nicolae Testemițanu”
Materials for the course
• Library:
1) Raevschi Elena, Tintiuc Dumitru: Biostatistics and Research Methodology;
Chisinau 2011

• SIMU (Biostatistics / theory):


1) .pdf presentations of my lectures (student’s version)
2) Reporting of findings of medical research : Term Project Presentation Guide,
Chisinau 2016
3) Questions for the intermediary test 1 and test 2
Outline
• 1. Basic terminology in biostatistics
• 2. Types of variables. Scales of measurement
• 3. Frequency distribution. Graphical presentation of data. Shape of a
distribution
1. Basic terminology
Statistics, population, sample, observational unit, variable, score …
Statistics

Statistics is the science of collecting, organizing, analyzing, and


interpreting data

The concepts of statistics may be applied to a number of fields that


include business, psychology, agriculture, etc.

When the focus is on the biological and health sciences, we use the
term biostatistics
Population and Samples
• Population: A population is the universe about which an
investigator wishes to draw conclusions. The collection of cases
that comprises the entire set of cases with the specified
characteristics.
Example: all living adult males in a country

• Sample: a subset drawn from a larger population – the part that is


actually being observed or studied.
Example: a sample of 1000 male adults living in a country
A population and a sample drawn from the
population
An observational unit or an
element: the person or thing
on which measurements are
taken. Also called case, subject
(when a person).

The number of observational


units in a population is
denoted by N, and the number
of observational units in a
sample by n
Sample
The sample MUST represent the population as a whole (the sample must be REPRESENTATIVE).
A sample is representative if it closely resembles the population from which it is drawn.
• the elements of a sample should be selected randomly (at random)
• the sample must have an adequate size
Non-representative samples can cause serious problems

Different types of samples exist.


Some examples :
Simple random sample : selecting cases from a population in a manner that ensures that each
element of the population has an equal chance of being selected into a sample.
Stratified random sampling: the population is first divided into internally relatively homogeneous
groups, or strata, from which random samples are then drawn .
Variable
When gathering data we collect data from observational units
(individual cases) on particular variables

A variable is a measured characteristic of an observational unit


Examples: income, gender, age, height, weight, blood pressure etc.

Observed value or score : the data. The actual value of a variable for
one of the observational units.
Parameter and statistic
• Parameter : a value or values derived from a population data
Ex.: an average height of all adult males in a country

• Statistic : a value or values derived from a sample data


Ex.: an average height a sample of 1000 male adults living in a country
A population
A sample
Observational unit
(an element)

Variable
Examples:
-Age
-Weight
-Pulse
-Smoking status

Descriptive statistics merely describe, organize, or summarize data


Inferential statistics: statistics, derived from sample data, that are used to make inferences (conclusions)
about the population from which that sample was drawn.
Descriptive and inferential statistics
• Descriptive statistics merely describe, organize, or summarize data
Examples: the mean blood pressure of a group of patients

• Inferential statistics : statistics, derived from sample data, that are


used to make inferences (conclusions) about the population from
which that sample was drawn.
Descriptive and
Inferential Statistics

Sampling
Sample Descriptive
Parameter Population statistics
µ : population
mean
σ : standard Inferential Measure
deviation
etc. statistics data

Probability

Probability plays a key role in inferential statistics.


2. Types of variable (types of data).
Scales of measurement
Variable
When gathering data we collect data from observational units
(individual cases) on particular variables

A variable is a measured characteristic of an observational unit.


Variable takes on a number of values/scores (more then one).

Examples: income, gender, age, height, weight, blood pressure etc.

Observed value or score : the data. The actual value of a variable for
one of the observational units.
Observational unit: a student A sample of 10 students
VARIABLES
ID Sex Colour of eyes Smoking Mark on Height (cm)
status anatomy
Student 1 M Black <5 8 160
Student 2 M Green No 8 165
Student 3 M Green No 7 170
Student 4 M Blue No 6 175
Student 5 F Blue 5-10 5 150
Student 6 F Brown 5-10 5 145
Student 7 F Black >10 4 170
Student 8 M Blue No 10 165
Student 9 F Green No 9 160
Student 10 F Black <5 6 185

A statistic: average height/mark on anatomy


of 10 students/% of non-smokers A score / a value
Types of variables (1)
1. Qualitative variable : describes a quality of the person or thing studied. It
is also called a categorical variable because the data fall into categories or
classes. Numbers are often used to represent these categories. But these
numbers serve just as labels, not as numerical values
Example: gender : 1= “male” and 2 = “female” (1 is not more or less than 2);
colour of eyes; nationality; type of blood; diagnosis etc.

2. Quantitative or numerical variable: measure the quantity of something.


The assigned values are ordered and meaningful
Examples: “height” (higher values on this variable indicate higher height);
pulse, number of cigarettes, blood pressure, number of children, serum
cholesterol, etc.
Types of variables (2)
1. Qualitative or Categorical
a) Alternative (dichotomous, binary, binomial) : variables that have only
two categories
Examples: Smoking status: “Yes and No”
Gender: “Male and Female” ;
Residence: “Urban and Rural”

b) Non-alternative : variables that have more than two categories


Examples: colour of eyes: green, blue, ….
level of education: primary, secondary, higher school, …
stages of injury : I (minor), II (moderate), III (severe), IV (fatal)
etc.
Types of variables (3)
2. Quantitative or Numerical
a) Discrete : variables that take on only whole numbers, no
intermediate values are possible
Examples: patient’s pulse, number of risk factors a patient has, number
of prior hospital admissions a patient had; number of cigarettes a
patent smokes per day etc.

b) Continuous : variables that take on any values on a continuum.


Continuous data often include decimals or fractions of numbers
Examples : patient’s height, weight, blood pressure, serum cholesterol,
age, temperature, length of time of survival after operation etc.
Types of Variables (4). Scales of measurement
Qualitative or Categorical
• Nominal
• Ordinal Mnemonic :
Quantitative or Numerical “NOIR”
• Interval
• Ratio
Why is the type of variable important?
The methods used to display, summarize, and analyze data depend on whether the
variables are categorical or numerical.
Nominal scale data
Nominal scale data are qualitative data
The data are classified into different qualitative categories or groups
 The order between the categories is NOT meaningful. In other words, the
categories can be listed in any order without affecting the relationship between
them
 When only two possible categories exist, the variable is sometimes called
dichotomous or binary
Examples of nominal scale data
Examples of nominal variables might include:
 Gender (male, female)
 Eye color (blue, brown, green, hazel)
 Surgical outcome (dead, alive)
 Blood type (A, B, AB, O)
 Outcomes of medical treatment or surgical procedure
 Presence of possible risk factors
Ordinal scale data
 Ordinal scale data, like nominal data, are qualitative data and classified into
different categories

 The order among the categories becomes meaningful. In other words the
categories can be ranked above or below each other

 However, there is no information about the quantitative distance between


categories, i.e. the distance (the interval) between the categories is not equal and
cannot be calculated
Examples of ordinal scale data
 Severity of injury (1=fatal; 2=severe; 3=moderate; 4 = minor)
The difference between a fatal injury (score = 1) and a severe injury (score = 2) is
not necessarily the same as the difference between a moderate injury (score = 3)
and a minor one (score = 1).

 Stage of cancer (stage I, II, III, IV)


 Education level (elementary, secondary, college, university, post-university)
 Apgar score to describe maturity of newborns (from 0 to 10)
 Satisfaction level (very dissatisfied, dissatisfied, neutral, satisfied, very satisfied)
 Students’ grade
Interval scale data
 Interval scale data are quantitative (numerical) data and, like ordinal
data, can be placed in a meaningful order

 In addition, they have constant, equal intervals between the items

 However, because interval scales do not have an absolute zero (i.e.


zero does not mean a true absence of something), ratios of scores are
NOT meaningful
Examples of interval scale data
 Temperature on the Celsius scale. The difference between 100° and 90° is the same
as the difference between 50° and 40°. However, 100°C is not twice as hot as 50°C
because 0°C does not indicate a complete absence of heat.
Ratio scale data
Ratio scale data have the same properties as interval scale data; however,
because there is an absolute zero, meaningful ratios DO exist.
Examples of ratio variables:
 Weight (50 kilos, 100 kilos, 150 kilos, etc.)
 Pulse rate (beats per minute)
 Blood pressure (millimetres of mercury)
 Age (years, months, days, hours)
Because there is an absolute zero, it is correct to say that a pulse rate of 120
beats/min is twice as fast as a pulse rate of 60 beats/min
Hierarchical data order
These levels of measurement can be placed in hierarchical order.

Ratio : Ratio becomes meaningful


Ratio
because there is an absolute zero

Interval : Distance becomes


Interval
meaningful, ratio is not meaningful
because there is no absolute zero
Ordinal
Ordinal : Order becomes meaningful,
distance is not meaningful
Nominal
Nominal : Order is not meaningful
Types of variables and scales of measurement
(overview):
• Quantitative (Numerical) / Qualitative (Categorical)
• If Qualitative (Categorical):
• Alternative / non-alternative
• Nominal / Ordinal

• If Quantitative (Numerical) :
• Discrete / Continuous
• Interval / Ratio
3. Frequency distribution. Graphical
presentation of data. Shape of a
distribution
A set of unorganized data is
difficult to understand.
A simple first way of organizing
the data is to list all the possible
values (scores) on a variable
between the highest and the
lowest in order, recording the
frequency (ƒ) with which each
score occurs.

This forms a frequency


distribution
• Distribution : a collection of scores from a sample on a
single variable. Often these scores are arranged in order
from smallest to largest.

• Frequency (f): how often a score occurs in a distribution


The data can be made more GROUPED FREQUENCY DISTRIBUTIONS OF SERUM
manageable by creating a grouped CHOLESTEROL LEVELS IN 200 MEN
frequency distribution

Individual scores are grouped


(between 7 and 20 groups are
usually appropriate)

Each group of scores has an equal


class interval

In this example, there are 10


groups with a class interval of 10
(161 to 170, 171 to 180, and so on)
EXAMPLES OF FREQUENCY DISTRIBUTION TABLES FOR
CATEGORICAL AND NUMERICAL DATA
Choice of graph dependent on data type

Nominal Pie chart or bar chart Mnemonic :


Ordinal Pie chart or bar chart
“NOIR”
Numerical (Interval or Line graph, histogram,
ratio) frequency polygon,
scatterplot

Why is the type of variable important?


The methods used to display, summarize, and analyze data depend on whether the
variables are categorical or numerical.
Frequency, f

Histogram
• Histogram depicts a
frequency distribution for
numerical data (discrete or
continuous)
• X axis shows the class
intervals, and Y axis shows
the frequencies
• No gaps between bars
• Give you idea about the Serum cholesterol, mg/dL
shape of the frequency
distribution Histogram of grouped frequency distribution
of serum cholesterol levels in 200 men
Frequency, f
Frequency
polygon
• Frequency polygon is also
used to display frequency
distribution for numerical
data
• The same two axes are
used as for histogram
• The midpoints of each
class interval are joined by Serum cholesterol, mg/dL
straight lines
Frequency polygon of distribution of serum
cholesterol levels in 200 men
Frequency, f
Histogram and
frequency polygon
Histogram and frequency
polygons can be easily
superimposed for comparison

Serum cholesterol, mg/dL

Histogram and frequency polygon of distribution


of serum cholesterol levels in 200 men
Scatterplots
Scatterplots illustrates
the relationship between
two numerical variables
(interval or ratio scale
variables)
Bar charts
• Bar charts are used to display
categorical data (nominal or
ordinal scale data)
• Each rectangle on the graph
is clearly separated from the
others by a space

Bar graph of mean serum cholesterol levels in 100 men and


100 women
Pie charts
• Pie charts are used to display
categorical data (nominal or
ordinal scale data)

Viral hepatitis morbidity in a country X


Normal distribution

Many naturally occurring


phenomena are
approximately distributed
according to the
symmetrical, bell-shaped,
normal or Gaussian
distribution

body temperature, shoe sizes, diameters of trees, Score


weight, height etc…

Karl Friedrich Gauss, the German mathematician (1777-1855)


Positively (right) skewed Negatively (left) skewed

Tail of a curve Tail of a curve

Low scores High scores Low scores High scores

• Asymmetric frequency distributions are called skewed distributions


• Positively (or right) skewed distributions and negatively (or left) skewed distributions can be
identified by the location of the tail (!) of the curve
• Positively skewed distributions have a relatively large number of low scores and a small number
of very high scores
• Negatively skewed distributions have a relatively large number of high scores and a small
number of low scores
Normal / symmetric / bell-shaped Bimodal

Uniform

Positively (right) skewed Negatively (left) skewed


Normal / symmetric / bell-shaped

Uniform

Positively (right) skewed Negatively (left) skewed


Summary
• Biostatistics
• Population/sample/observational unit
• Parameter and statistic
• Descriptive and inferential statistics
• Types of variable (data)
• Scales of measurement: NOIR
• Frequency distribution (ungrouped, grouped)
• Types of frequency distribution (normal, skewed to right/left, bimodal, uniform)
• Graphical presentation of categorical data (bar/column carts, pie charts)
• Graphical presentation of numerical data (histogram, frequency polygon,
scatterplot, line chart)

You might also like