Chapter - I 1. Introduction: - 1.1 Definition and Classification of Statistics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Introduction to statistics

1. Introduction: -
1.1 Definition and Classification of Statistics
Defn: - Statistics: is the science which deals with the collection, organization, Presentation, analysis and interpretation of numerical data.
Data: is the measurement or observation (values) for a variable (factor)
■A collection of data values forms a data set.
■Each value in the data set is called a data value or datum.
o Why we need statistics?

Classification of Statistics
There are two branches of statistics:-
1. Descriptive statistics:-Statistical method that deals with describing or summarizing a given set of data.
o Here there is no generalization or conclusion about the population.
o Consists of collection, organization and presentation of data.
o E.g. frequency distribution, measure of central tendency (such as mean, median) measure of dispersion (like range, , variance,
2. Inferential statistics: - Is the process of drawing conclusion (inference) about a population based on the information obtained from the sample.
o Is performing and testing hypothesis, determining relationships among variables, and making prediction.
o Used to describe, infer, estimate, approximate the characteristics of the target population

 Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger
 These inferences may take the form of answers to yes/no questions (hypothesis testing), estimates of numerical characteristics
(estimation), forecasting of future observations, descriptions of association (correlation), or modeling of relationships (regression).
Introduction to statistics

If the sample is representative of the population, then inferences and conclusions made from the sample can be extended to the
population as a whole.
 Statistics offers methods to estimate and correct for randomness in the sample and in the data collection procedure,
The fundamental mathematical concept employed in understanding such randomness is probability.

Need for Statistics

1.2 Stages in Statistical Investigation

 Involves the following stages:-
Stage 1. Data collection
o Is the process of gathering information or data about the variable of interest.
o Data obtained either primarily or secondarily.
Stage 2. Organization of data
o Is the process of editing, classification and tabulation of data.
o Editing: is the process of checking and connecting data for omission, inconsistencies, irrelevant answer and wrong computation in the
collected data.
o Classification: is the task of grouping the collected and edited data in to different similar categories based on some criteria
o Tabulation: is to put classified data in the form of table.
Stage 3. Presentation of data
Introduction to statistics

o Is to display what is contained in our data in the form of pictures. E.g. Diagrams and graphs.
Stage 4. Analysis of data
o Is the mathematical operation on collected & organized data E.g. Calculating Mean, variance, etc.…
Stage 5. Interpretation
o Is giving meaning to the result obtained in the analysis stage.
1.3 Definitions of Some Basic Terms
A variable:-
o Is the factor or characteristics that can take on different possible value or outcome.
o A variable can be qualitative or quantitative
o A qualitative (categorical) variable: - is the variable that can be expressed in categorical ways. I.e. it cannot be expressed in terms of
o E.g. Sex, marital status, Religion, Region etc.…
o A Quantitative variable: - is the variable that can be measured in numerical ways.
o E.g. Height, income, weight, age, etc…

 Quantitative random variable is further sub-divided in to two, these are:

I. Discrete random variable: - usually obtained by counting.
o Here there is a jump between values. Ex. No of children/ family
II .Continues random variable: - usually obtained by measurement.
o No jump between values. Example: - age, weight, height
Elementary unit:
o Is the specific person, business, product, and so on, with some characteristics to be measured or categorized.
o E.g. The weight of particular person in this class.
o Is the totality of causes (items) under consideration in a given investigation or research. Ex. the total number of students in JUCAVM.

Population can be finite or infinite.

o Finite population: is the population that can be finite (can be limited in size). E.g. No of students in JUCAVM.
o In finite population: is the population that is unrestricted in nature. E.g. the Production of bacteria (the observation is cannot be even in
Introduction to statistics

o Any non-empty subset of a population. E.g. 25 stuff of JUCAVM out of 500 stuffs.
o It is the measurable character of the population.
o It is numerical result obtained as measuring the population.
Statistic: (not to be confused with Statistics)
o It is measurable character of the sample.
o It used to estimate parameter.
o Is the method of obtaining sample from the population.
Survey experiment:
o It is the device of obtaining the desired data. E.g. Collection observations based on the weight of students in Agro-Economics department.
Statistical design:
o It is the process that involves a decision problem and choosing an approach to solve the problem.
o It guide that indicates how an investigation is going to channeled.
o It is the listing of all elementary units in the population under consideration (Study).

1.4. Uses and limitation of statistics

o It presents facts in a definite (definiteness) & precise form.
o It simplifies mass of data (data reduction/ condensation)
o It furnishes & techniques of comparison.
o It estimates unknown population parameter.
o For testing & formulating of hypothesis.
o Forecasting of future events.
o It deals with only aggregates of facts and not with individual data.
o Deals with only quantitative information (data).
o Statistical results are not usually exact rather it is approximate.
o Statistics is liable to be misused.
Introduction to statistics


2. Introduction to Method of Data Collection & Presentation:

Types and Method of Data Collection

o Collection of data implies a systematic and meaningful assembly of information for the accomplishment of the objective of a statistical
o It refers to the methods used to gathering the required information from the units under investigation. The quality of data greatly affects
final output of an investigation. Hence, utmost care should be attached to the data collection process and every possible precaution should
be taken to ensure accuracy while collecting data. Otherwise, with inaccurate and inadequate data, the whole analysis is likely to be faulty
and also the decisions to be taken will also be misleading.

2.1. Source of Data

o Statistical data may be obtained either from primary or secondary source.
o A primary source is a source from where first hand information is gathered. On the other hand, secondary source is the one that makes data
available, which were collected by some other agency.
Data obtained from a primary source is called primary data. Likewise data gathered from secondary source is known as secondary data. For
example, assume that a simple study is to be conducted to see the age distribution of HIV / ADIS victims’ citizens. Clearly the available of study
is age. Data about the age of HIV/ADIS victim citizens may be obtained by making direct interview with the victims. Note in this specific case,
the victim citizens are primary sources. Moreover, the data to be collected from them are primary data. Alternatively, one may use records of
hospital and other related agencies to obtain the age of the victim citizens without the need of tracing the victims personally. Therefore, the records
of the hospitals, in our case are secondary source and the data copied from such records are secondary data.
2.3 Method of Data Collection

A) Method of Primary Data Collection

After discussing the two sources of data, primary and secondary, it is logical to say a few words about the methods employed in collecting data
from its original or primary source. Many authors commonly state three methods collecting primary data. These are personal enquiry method
(interview method), direct observation & Questionnaire method.
Introduction to statistics

i) Personal Enquiry Method (Interview Method)-

In personal enquiry method, a question sheets prepared which is called schedule. The schedule contains the entire question, which would
extract a complete report from respondents. Usually, schedules are pre-tested so as to remove certain discrepancies like ambiguities of the
questions and irrelevant questions. This pre-testing process is called a pilot survey. It is worth mentioning that the schedule is not directly
given to the respondent rather it is the interviewer who asks those questions on schedules and jot down the interviewee’s (respondents)
responses. Depending on the nature of the interview, personal enquiry method is further classified into two types.
Direct Personal Interview: It is a type personal enquiry where there is a face to face contact with the persons from whom the
information is to be obtained in other words, the investigator contacts each respondents personally, without the interference of third party, and
ask questions given in the schedule one by one and notes down respondent’s replies on the schedule.
Indirect Personal Enquiry (Interview): It is the secondary type of personal enquiry where the investigator contacts third party called
witnessed who are capable of supplying the necessary information. Here, the information is not collected directly from the respondents but
from a third person who knows the respondent well. Such an approach is useful in case where the respondent is expected to conceal
information about him/her.

ii) Direct Observation

In this approach, an investigator stays the place of survey and notes down the observation himself. There are no enquiries in the case of direct
observation. For example an investigator making a study on nutritional status of children may directly (physically) measure the weight, height
and other required parameters himself/herself. Direct observation is more experimental and usually applied in scientific studies. It is time
consuming and also costly.

iii) Questionnaire Method

Under this method, a list of questions related to the survey is prepared and sent to the various respondents by post, website, email
etc .However; this method cannot be used if the respondent is illiterate. It is a method that is often used many statistical investigations.

The following are the major points that we need to take into account while preparing the questionnaire. The number of questions should be
small. Naturally respondents are not comfortable with lengthy questionnaires. Lengthy questionnaire usually bore respondents. Hence, fifteen
to twenty five questions in a questionnaire are optimal. If a lengthy questionnaire is unavoidable, it should preferably be divided in to two or
more parts.
Introduction to statistics

o The question should be short, clear, simple, and unambiguous. Moreover, the question must be arranged in to a logical order so that
natural and spontaneous reply to each is induced. For instance it is not appropriate to ask a person how many packets of cigarette he
/she smoke before asking whether he/she smoke or not.
o Questions of sensitive nature should be avoided. Sensitive questions are those questions that are too personal and pecuniary like
source of income, drinking habit, etc. The logic here is that respondents do not willingly answer sensitive questions. Such information,
if necessary, may be gathered through interviews or through other indirect questions.
o Questions should be capable of objective answers. As much as possible, avoid subjective questions and keep to questions of fact. To
this end, multiple answer questions can be used.
o Mail questionnaires should be accomplished by a covering letter, which should state the purpose of the questionnaire, promise of
confidentially of responses, etc.
B) Method of Secondary Data Collection
o In most cases secondary data is obtained from such sources as census and survey reports, books, official records, reported experimental results, previous
research papers, bulletins, magazines, newspapers, web-sites and other publication. Different organizations and government agencies publish
information (data) in the form of reports, periodicals, journals, etc. in the case of Ethiopia; the central statistical authority (CSA) is the first to be
mentioned in publishing such relevant information (secondary data).
Advantage of Primary Data
o Primary data gives more reliable, accurate and adequate information, which is suitable the objective of and purpose of an investigation.
o Primary source usually shows data in greater detail.
o Primary data is free from errors that may arise from copying of figures from Publications which is the case in secondary data.

Disadvantage of Primary data

o The process of collecting primary data is time consuming and costly.
o Often, primary data gives misleading information due to lack of integrity of investigators and non-cooperation of respondents in providing
answer to certain delicate questions.
■Advantage of Secondary Data
o It is readily available and hence convenient and much quicker to certain than primary data,
o It reduces time, cost and effort as compared to primary data,
o secondary data may be available in subjects(cases) where it is impossible to collect primary data….such a case can be regions where there is war.
■Some Disadvantage of Secondary Data:
o Data obtained may not be sufficiently accurate,
Introduction to statistics

o Data that exactly suit our purpose may not be found,

o Error may be made while copying figures.
2.4. Data Presentation
o There are two methods of data presentation.
Tabular presentation & Diagrammatic presentation
2.4.1. Tabular Presentation (Frequency Distribution)
 Frequency Distribution
oIs a table that shows data classified in to a number of classes with a corresponding no of times falling in each class (frequency).
Frequency: - is the no of times a certain value of the variable is repeated in a given Class.
 Types of Frequency Distribution
a) Categorical Frequency Distribution: -
o Here the classification criteria is qualitative, qualitative random variable is used.
E.g. in JUCAVM 25 students applied a scholarship were classified according to their class year.
1st ,2nd ,2nd ,4th ,3rd ,2nd ,3rd ,3rd ,4th ,1st,2nd ,3rd ,1st ,2nd ,3rd ,3rd ,
3rd ,4th ,1st ,3rd ,2nd 2nd, 3rd, 4th, 1st
 Categorical frequency
 Relative frequency distribution
 Percentage frequency.
Class year Frequency Rf = fi/ft Pf =Rfx100%
1st 5 0.2 20
2nd 7 0.28 28
3rd 9 0.36 36
4th 4 0.16 16
Total 25 1 100
Note: - Σ Rf =1 and Σ Pf=100.

b) Numerical Frequency Distribution: -

Introduction to statistics

o Here the classification criterion is quantitative. It is grouped in to two. These are: - Simple (Ungrouped) frequency distribution &
Grouped frequency distribution.
I. Simple (Ungrouped) Frequency Distribution: - is the distribution that use individual data values along with its distribution.
 Usually used when the data range is small.
E.g. Raw data on the number of children per family.
Required: Construct: - ungrouped frequency distribution, Rf, Pf.

No of child/ family frequency Rf Pf

0 3 3/20 3/20 x100
1 4 4/20 4/20 x100
2 6 6/20 6/20 x100
3 4 4/20 4/20 x100
4 3 3/20 3/20x100

o Cumulative Frequency Distribution: -is a frequency distribution that displays the sum of frequencies of consecutive classes of above or
below a given class.
There are two types of cumulative frequency: -
a) Less than cumulative frequency (Lcf): it is used when our interest focuses on the total number of observation lies below a specified value.
b) More than cumulative frequency (Mcf): it used when frequency interest focuses on the total no of observation above a specified value.

Class frequency Lcf Mcf
0 3 3 20
1 4 7 17
2 6 13 13
3 4 17 7
4 3 20 3
Total 20
Introduction to statistics

II. Grouped Frequency Distribution:

o Is a frequency distribution having several values grouped in to one class.
Usually used when the range of the data is large.
Terms (Defn): -
1. Class Limit: - separate one class from the other within a certain gap. There are two types of class limit:-
■ Upper class limit (UCL)
■ Lower class limit (LCL)
There is a gap between UCLi &LCLi+1.
o Unit of measurement (U):- is the smallest possible distance between two consecutive measures.
o U is usually taken as 1.
2. Class Boundary:-have two parts. These are
■Upper class boundary (Ucb)
■ Lower class boundary (Lcb).
 Separate one class from the other.
No gap between Ucbi &Lcbi+1
Lcbi =LCLi -U/2
Ucbi =UCLi+ U/2

Class Limit Class Boundary
1-5 0.5-5.5
6-10 5.5-10.5
11-15 10.5-15.5
16-20 15.5-20.5
3. Class Width (w): -Is the difference between lower & upper class boundary of any class. And it is possible to find the class width in any of the following
Alternatively possible to find in the following way:
W = Lcbi+1-Lcbi
Or W = Ucbi+1-Ucbi
Introduction to statistics

Or W = UCLi+1-UCLi
Or W = LCLi+1-LCLi
4. Class Mark (Mid Point =Mi)
Mi is the midpoint of a class interval or the average value of the lower and upper class limits.
i.e. Mi= (LCLi+ UCLi)/2 , Mi= (Lcbi+Ucbi)/2
Steps Needed to Construct Grouped Frequency Distribution
1. Calculate the range (R)
R=Xmax- Xmin
2. Calculate the number of class using the sturge’s formula
k= 1+3.322logn, where k-No of classes
n- No of observation and
n = Σfi
Here always make it round up. E.g. k=4.5 ~ 5
3. Calculate the class width
W=R/K R& K must be round up the next whole number.
4. Identify the starting point:- LCL1= Xmin
LCL2=Xmin +W
E.g. Construct a grouped frequency distribution for the following raw data.

11, 29, 6, 33, 14, 31, 22, 27, 19, 20, 21, 18, 17, 22, 38, 23, 26, 34, 39, 27
1. R= Xmax-Xmin, 39-6 = 33
2. K=1+3.322 log20 =5.32 ~ 6
3. W=R/K , 33/6=5.5 ~ 6
4. Determine LCL1=Xmin=6
Class Limit fi Mi Class boundary Lcf Mcf
6-11 2 8.5 5.5-11.5 2 20
12-17 2 14.5 11.5-17.5 4 18
18-23 7 20.5 17.5-23.5 11 16
24-29 4 26.5 23.5-29.5 15 9
30-35 3 32.5 29.5-35.5 18 5
36-41 2 38.5 35.5-41.5 20 2
Introduction to statistics

2.4.2. Diagrammatic & Graphic Presentation of Data

►Presentation of data diagrammatically is simple & easy to understand.
i. Bar-Chart (Bar diagram): A series of equally spaced bars having equal width (base) where the height the bar represents the frequency of
(amount) associated with each class.
Usually applied for categorical random variables.
A bar chart could be either vertical or horizontal.
E.g. construct a bar chart for the previously used scholarship data.

Class year Frequency

1st 5
2nd 7
3rd 9
4th 4
Total 25

There are various types of bar chart. These are:-

a) Simple bar chart:
b) Multiple bar chart: - various information in one bar.
c) Component (sub-divided) bar chart

ii) Pie Chart:-Is the circle that is divided in to different sectors according to the percentage of frequency in to each category of the distribution with angle in
proportion of 360° to the amount associated to each category.
Class frequency Rf Pf 360xPf (in degree)
1st 5 5/25 20% 72°
2nd 7 7/25 28% 100.8

3rd 9 9/25 36% 12 9.6

4 4/25 16% 57.6
Total 25 1 100% 360

iii) Pictogram: it represents the magnitude of certain things by their pictures.

o It is not frequently used.
Graphic presentation data:
1. Histogram: usually used to present quantitative data.
Introduction to statistics

o Is a graph consists of series of rectangles whose bases are equal to the class width of the corresponding classes & whose heights are proportional to class frequencies.
o It is constructed from a grouped frequency distribution.
o In histogram we use class boundaries in the X-axis.

E.g. construct a histogram for the following data.

Class limit CB Frequency
6-10 5.5-10.5 1
11-15 10.5-15.5 2
16-20 15.5-20.5 3
21-25 20.5-25.5 5
26-30 25.5-30.5 4
31-35 30.5-35.5 3
36-40 35.5-40.5 2
Total 20
Frequency Polygon:
o Is the line graph that displays the data using a line that connects points plotted for the frequencies of the class mark.
o i.e. the frequencies represent the height of the class mark.
o A frequency polygon can also be super imposed on a histogram.

Cumulative Frequency Polygon (Ogive):

o This is a line graph obtained by plotting the cumulative frequency distribution(y- axis) against class boundaries (x-axis).
Class boundary fi Lcf Mcf
5.5-10.5 1 1 20
10.5-15.5 2 3 19
15.5-20.5 3 6 17
20.5-25.5 5 11 14
25.5-30.5 4 15 9
30.5-35.5 3 18 5
35.5-40.5 2 20 2
Total 20
Introduction to statistics

Exercise. The following table is a grouped frequency distribution of money spent per visit by a random sample of 100 customers at a dep’t store.
Amount of spent no of customers
3-7 10
8-12 30
13-17 35
18-22 20
23-27 5
Total 100

i). for each of the above class state:-

a) class limit
b) class boundary
c) the class width
d) the class mark
ii). Construct the relative frequency distribution
iii). Construct a histogram & super imposed the frequency polygon
iv). Construct both less than & more than type of ogive.

You might also like