Data Presentation and Analysing: Orcid ADK-8747-2022
Data Presentation and Analysing: Orcid ADK-8747-2022
Data Presentation and Analysing: Orcid ADK-8747-2022
Abstract
Data is the basis of information, reasoning, or calculation, it is analysed to obtain
information. Data analysis is a process of inspecting, cleansing, transforming, and data
modeling with the main aim of discovering useful information, informing conclusions, or
supporting theories for empirical decision-making. It is observed that most business
decision is taken in a dynamic and uncertain business environment, therefore data
analyses are necessary to assist management in take an informed decision that will
enhance performance. Data enable business plan again without repeating past mistakes.
Data Presentation forms an integral part of all academic and business research as
well as professional practices. Data presentation requires skills and an
understanding of data. It is necessary to make use of collected data which are
considered to be the raw fact that needs to be processed to provide empirical
information.
Analysis starts with the collection of data, followed by processing (which can be
done by various data processing methods). Processed data helps in obtaining
information from it, as the raw form is non-comprehensive in nature. Data can be
presented in form of a table of frequency (which is displayed in frequencies or
percentages or both), diagrammatic display (graphs, charts, maps and other
methods). These methods help in adding the visual aspect which makes it much
more comfortable and easy to understand. This visual representation is also called
data visualization.
Data analysis in business to understand problems facing an organisation, and to explore
data in meaningful ways. ess. This means finding out more about your customers, knowing
what they buy and how much they spend as well as their preferences and purchasing
habits. It also involves keeping tabs on your competitors to find out who’re their
customers and their spending habits, preferences, and other details in other to gain
competitive edge in the market.
Section I: Introduction
Data is defined as things known or assumed as facts which form basis of information,
reasoning or calculation. Meaning data is analysed to obtain information. Therefore data
analysis is a process of inspecting, cleansing, transformation and data modelling with the
1
main aim of discovering useful information, informing conclusions, and supporting
empirical decision-making. These processes involves a number of closely related
operations which are performed with the purpose of summarizing the collected data,
organize and manipulate to bring out information that proffer solution to research
questions raised. Xia and Gong, (2015) opined that in today's business world, data analysis
plays a role in making decisions more scientific and helping businesses operate more
effectively.
Most business decision are taking under dynamic and uncertain business environment,
data analyses is necessary to assist management and other stakeholders take an informed
decision that will enhance performance. Data enable business plan again without repeating
past mistakes.
Data Presentation forms an integral part of all academic and business research as
well as professional practices. Data presentation requires skills and understanding
of data. It is necessary to make use of collected data which are considered to be raw
fact that need to be processed to provide empirical information. Data analysis helps
in the interpretation of data and help take a decision or answer the research
questions raised in the process of research. This is done by using various Data
processing tools and Software. Analysis starts with the collection of data, followed
by processing (which can be done by various data processing methods). Processed
data helps in obtaining information from it, as the raw form is non-comprehensive
in nature. Data can be presented in form of table of frequency (which is displayed
in frequencies or percentage or both), diagrammatic display (graphs, charts, maps
and other methods). These methods help in adding the visual aspect which makes it
much more comfortable and easy to understand. This visual representation is
also called data visualization.
Data analysis is a process of inspecting, cleansing, transformation and data
modelling with the main aim of discovering useful information, informed
conclusions, and supporting empirical decision-making. These processes involves a
number of closely related operations which are performed with the purpose of
summarizing the collected data, organize and manipulate to bring out information
that proffer solution to research questions raised. Xia and Gong, (2015) opined that
in today's business world, data analysis plays a role in making decisions more
scientific and helping businesses operate more effectively. Most business decision
are taking under dynamic and uncertain business environment, data analyses is
necessary to assist management and other stakeholders take an informed decision
that will enhance performance. Data enable business plan again without repeating
past mistakes.
Data analysis tools make it easier for users to process and manipulate data, analyze
the relationships and correlations between data sets, and it also helps to identify
patterns and trends for interpretation. This paper is organized in five section,
Section I is the introduction, section II review types of Data while Section III deals
with Method of Analyzing Data while section IV is review various method of data
cleaning and Section V is about the summary and conclusion of the paper.
2
information. There are basically two types of data in statistics, which can be further
sub-divided into four (4) types. Data can be quantitative or qualitative, quantitative
data are data about quantities of things, things that we measure, and so we describe
them in terms of numbers. As such, quantitative data are also called Numerical data
while qualitative data (which is also known as categorical data) give us information
about the qualities of things. They are observed phenomenon, not measured, and so
we generally label them with names. Therefore all data that are collected are either
measured (Quantitative data) or are observed feature of interest (Qualitative data).
Quantitative Data can either be discrete and continuous data, discrete data is data
that can only take certain values and cannot be made more precise. This might only
be whole numbers, like the numbers of students in a class at certain time (any
number from 1 to 20 or even 40) or could be other types of fixed number scheme,
such as shoe sizes (2, 2.5, 3, 3.5, and so on. They are called discrete data because
they have fixed points and measures in between do not exist (you cannot get 2.5
students, nor can you have a shoe size of 3.49). Counted data are also discrete data,
so number of patients in the hospital and number of faculty in the university are all
examples of discrete data.
Continuous data are data that can take any value, usually within certain limits, and
could be divided into finer and finer parts. A person's height is continuous data as it
can be measured in metres and fractions of metres (centimetres, millimetres,
nanometres). Time of an event is also continuous data and can be measures in years
and divided into smaller fractions, depending on how accurately you wish to record
it (months, days, hours, minutes, seconds, etc.).
While quantitative data are measured, qualitative data are observed and placed into
categories, exception to this is when categories have been numbered for practical
purposes, such as types of Animals (1, 2, 3…) instead of (Pig, Sheep, Cow…). In
this case, the numbers must be treated as the names of the categories - you're not
allowed to do any calculations with them.
But Stanley (1946) an American psychologist sub divided quantitative and
qualitative data into four (4) types; Ratio, Interval, Ordinal and Nominal Data.
a. Ratio Data is defined as quantitative data, having the same properties as
interval data, with an equal and definitive ratio between each data and
absolute “zero” being treated as a point of origin. In other words, there can
be no negative numerical value in ratio data.
b. Interval data, also called an integer, is defined as a data type which is
measured along a scale, in which each point is placed at equal distance from
one another. Interval data always appears in the form of numbers or
numerical values where the distance between the two points is standardized
and equal.
c. Ordinal data is a categorical, statistical data type where the variables have
natural, ordered categories and the distances between the categories is not
known. These data exist on an ordinal scale, one of four levels of
measurement.
d. Nominal data (also known as nominal scale) is a type of data that is used to
label variables without providing any quantitative value. It is the simplest
3
form of a scale of measure. ... One of the most notable features of ordinal
data is that, nominal data cannot be ordered and cannot be measured.
4
Diagnostic Analysis
Predictive Analysis
Prescriptive Analysis
a. Inferential Analysis
Statistical inference is the process of using data analysis to infer properties of an
underlying distribution of probability. Inferential statistical analysis infers
properties of a population, for example by testing hypotheses and deriving
estimates. It arise out of the fact that sampling naturally incurs sampling error
and thus a sample is not expected to perfectly represent the population.
5
to a distortion or asymmetry that deviates from the symmetrical bell curve,
or normal), Kurtosis (statistical measure that defines how heavily the tails of
a distribution differ from the tails of a normal distribution) and standard
deviation.
When we use descriptive statistics it is useful to summarize our group of data
using a combination of tabulated description (i.e., tables), graphical description
(i.e., graphs and charts) and statistical commentary (that is discussion of the
results).
6
strengths of predictive analytics is its ability to input multiple parameters.
For this reason, they are one of the most widely used predictive analytics
models in use. They are used in different industries and business purposes.
Forecast models are popular because they are incredibly versatile.
7
v. Clustering Model
The clustering model takes data and sorts it into different groups based on
common attributes. The ability to divide data into different datasets based on
specific attributes is particularly useful in certain applications, like
marketing. For example, marketers can divide a potential customer base
based on common attributes. It works using two types of clustering – hard
and soft clustering. Hard clustering categorises each data point as belonging
to a data cluster or not. While soft clustering assigns data probability when
joining a cluster.
Predictive analytics models have their strengths and weaknesses and are best
used for specific uses. One of the biggest benefits applicable to all models is
that they are reusable and can be adjusted to have common business rules. A
model can be reusable and trained using algorithms.
8
personal perspectives and interpretations. In this case source of data
collection is from a primary source.
Primary data source is an original data source, one in which the data are
collected firsthand by the researcher for a specific research purpose or
project. Primary data can be collected in a number of ways. Primary data
collection is quite expensive and time consuming compared to secondary
data collection. It is by comparing phenomenal and event through
observation, pictures, survey, interview or making use of questionnaires as
instrument of data collection.
b. Secondary data
Data that have been manipulated to fit into certain occasion are obtain here,
data are collected in regular interval which maybe weekly, monthly or yearly
but it must have regular interval. Source of such data are Annual reports of
organization, national data sources (Central Banks-www.cenbank.org;
Bureau of statistics-www.beureuofstatistic.org.ng), International data
sources (World Bank-www.wb.org; IMF-www.imf.org and so on) and other
developmental international agencies (International Labour organization-
www.ilo.org; United Nation Development Agency-www.undp). These data
are normally collected in Time series that is with equal interval; if is
collected in certain organization it is called Times Series Data (TSD) and if
cut across Sectors e.g. Banking sector and manufacturing sector it is called
Cross-Sectorial Data (SSD) data gather from different sectors or groups at a
single point in time. While Panel Data referred to as longitudinal data, is
data that contains observations about different cross sections across time
interval.
9
where uncertainty is present. In other words, it’s a model for a process that has
some kind of randomness, the opposite is a deterministic model, which predicts
outcomes with 100% certainty. If a time series has a unit root, it shows a systematic
pattern that is unpredictable. Most of data employed in financial time series display
trending behavior or non-stationarity in the mean examples are asset prices,
exchange rates and real GDP. It is important to determine the most appropriate
form of the trend in the data. Therefore unit root tests are tests for stationarity in a
time series. A time series has stationarity if a shift in time does not resulted to
change in the shape of the distribution. That is basic properties of the distribution
like the mean, variance and covariance are constant over time.
Therefore stationarity means that the statistical properties of a process generating a
time series do not change over time. It does not mean that the series does not
change over time, just that the way it changes does not itself change over time. The
algebraic equivalent is thus a linear function, perhaps, and not a constant one; the
value of a linear function changes as x grows, but the way it changes remains
constant- it has a constant slope; one value that captures that rate of change.
Non-stationary time series variables are called a unit root because of the
mathematics behind the process. A process can be written as a series of monomials
(expressions with a single term). Each monomial corresponds to a root. If one of
these roots is equal to 1, then that is a unit root. Unit-root problem is concerned
with the existence of characteristic roots of a time series model on the unit circle. If
a random walk model is;
zt = zt−1 + at (1)
Where, at is a white noise process.
White noise process is a random process X(t), if SX(f) is constant for all
frequencies. This constant is usually denoted by N02. The random process X(t) is
called a white noise process if SX(f)=N02, for all f.
This can be a sequence of martingale differences, that is, E(a t|Ft−1) = 0, Var(at|Ft−1)
is finite, and E(|at|2+∂ |Ft−1) < 1 for some ∂ > 0, where F t−1 is the ∂-field generated by
{at−1, at−2, . . .}. It is assumes that Z 0=0. It will be seen later that this assumption has
no effect on the limiting distributions of unit-root test statistics.
Chan and Wei (1988) opined that a martingale is a sequence of random variables (a
stochastic process) for which, at a particular time, the conditional expectation of the
next value in the sequence is equal to the present value, regardless of all prior
values.
10
Figure 1B: Time series generated by a non-stationary processes.
11
In other words, the usual “t-ratios” will not follow a t-distribution, so we
cannot validly undertake hypothesis tests about the regression parameters.
Due to these properties, stationarity has become a common assumption for many
practices and tools in time series analysis. These include trend estimation,
forecasting and causal inference, among others.
12
Co-integration tests identify scenarios where two or more non-stationary time series
are integrated together in a way that they cannot deviate from equilibrium in the
long term. The tests are used to identify the degree of sensitivity of two variables to
the same average price over a specified period of time.
Before the introduction of co-integration tests, economists relied on linear
regressions to find the relationship between several time series processes. Granger
and Newbold (1987) argued that linear regression was an incorrect approach for
analyzing time series due to the possibility of producing spurious correlation. A
spurious correlation occurs when two or more associated variables are deemed
causally related due to either a coincidence or an unknown third factor. This
possible result is a misleading statistical relationship between several time series
variables. Granger and Engle (1987) formalized the co-integrating vector approach.
Their concept established that two or more non-stationary times series data are
integrated together in a way that they cannot move away from some equilibrium in
the long term.
Male
Female
The two economists argued against the use of linear regression to analyze the
relationship between several time series variables because detrending would not
solve the issue of spurious correlation. Instead, they recommended checking for co-
integration of the non-stationary time series. They argued that two or more time
series variables with I(1) trends can be co-integrated if it can be proved that there is
a relationship between the variables.
13
the Augmented Dickey-Fuller Test (ADF) or other tests to test for stationarity units
in time series. If the time series is co-integrated, the Engle-Granger method will
show the stationarity of the residuals.
The limitation with the Engle-Granger method is that if there are more than two
variables, the method may show more than two co-integrating relationships.
Another limitation is that it is a single equation model. However, some of the
drawbacks have been addressed in recent co-integration tests like Johansen’s and
Phillips-Ouliaris tests. The Engle-Granger test can be determined using STAT or
MATLAB software.
b. Johansen Test
The Johansen test is used to test co-integrating relationships between several non-
stationary time series data. Compared to the Engle-Granger test, the Johansen test
allows for more than one co-integrating relationship. However, it is subject to
asymptotic properties or large sample theory is a framework for assessing
properties of estimators and statistical tests. Within this framework, it is often
assumed that the sample size n may grow indefinitely; the properties of estimators
and tests are then evaluated under the limit of n → ∞. In practice, a limit evaluation
is considered to be approximately valid for large finite sample sizes too. Since a
small sample size would produce unreliable results. Using the test to find co-
integration of several time series avoids the issues created when errors are carried
forward to the next step. Johansen’s test comes in two main forms, i.e., Trace tests
and Maximum Eigenvalue test.
c. Trace tests
Trace tests evaluate the number of linear combinations in a time series data, i.e., K
to be equal to the value K0, and the hypothesis for the value K to be greater than K0.
It is illustrated as follows:
H0: K = K0
H0: K > K0
When using the trace test to test for co-integration in a sample, we set K 0 to zero to
test whether the null hypothesis will be rejected. If it is rejected, we can deduce that
there exists a co-integration relationship in the sample. Therefore, the null
hypothesis should be rejected to confirm the existence of a co-integration
relationship in the sample.
14
that there are M possible linear combinations. Such a scenario is impossible unless
the variables in the time series are stationary.
15
Data analysis helps expose gray patches in your business and effectively counter
them by deploying different strategies. Data analysis will clearly indicate areas
where an online or offline business lags. And you can devise ways and means to
counter that.
16
REFERENCES
Caner, M. and L. Kilian (2001). Size distortions of tests of the null hypothesis of
stationarity: Evidence and implications for the PPP debate. Journal of International
Money and Finance, Vol. 20 (2). Pg. 639-657.
Dickey, D. and W. Fuller (1979). Distribution of the estimators for autoregressive time
series with a unit root. Journal of the American Statistical Association, Vol. 74, (9)
pg. 427-431.
Dickey, D. and W. Fuller (1981). Likelihood ratio statistics for autoregressive time series
with a unit root. Econometrica Vol. 49 (3). Pg.1057-1072.
Elliot, G., T.J. Rothenberg, and J.H. Stock (1996). Efficient tests for an autoregressive unit
root. Econometrica. Vol. 64 (8). Pg.813-836.
Everitt, B. S.; Skrondal, A. (2010). the Cambridge dictionary of statistics. Cambridge;
University Press.
Fuller, W. (1996). Introduction to statistical time series, Second Edition. New York; John
Wiley.
Gottschalk, L. A. (1995). Content analysis of verbal behavior: New findings and clinical
applications. New Jersey: Lawrence Erlbaum Associates, Inc
Hamilton, J. (1994). Time series analysis. , New Jersey; Princeton University Press.
Hatanaka, T. (1995). Time-series-based econometrics: unit roots and co-integration.
Oxford; University Press.
Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., Becker, B. (2008). The data
warehouse lifecycle toolkit. , New Jersey; Wiley Publishing. ISBN 978-0-470-
14977-5
Kwiatkowski, D., P.C.B. Phillips, P. Schmidt and Y. Shin (1992). Testing the null
hypothesis of stationarity against the alternative of a unit root. Journal of
Econometrics; Vol.54 (5). Pg 159-178.
MacKinnon, J. (1996). Numerical distribution functions for unit root and cointegration
tests. Journal of Applied Econometrics; Vol. 11 (8). Pg. 601-618.
Maddala, G.S. and I.-M. Kim (1998). Unit roots, cointegration and structural change.
Oxford; University Press, Oxford.
Ng, S., and P. Perron (1995). Unit root tests in ARMA models with data-dependent
methods for the selection of the truncation Lag. Journal of the American Statistical
Association; Vol. 90 (6). Pg. 268-281.
Ng, S., and P. Perron (2001). Lag length selection and the construction of unit root tests
with good size and power. Econometrica; Vol. 69 (7). Pg. 1519-1554.
Perron, P. and S. Ng. (1996). Useful modifications to some unit root tests with dependent
errors and their local asymptotic properties. Review of Economic Studies; Vol. 63
(23). Pg. 435-463.
Phillips, P.C.B. (1987). Time series regression with a unit root. Econometrica, Vol. 55
(21). Pg. 227-301.
Phillips, P.C.B. and P. Perron (1988). Testing for unit roots in time series regression.
Biometrika; Vol. 75 (6). Pg. 335-346.
Phillips, P.C.B. and Z. Xiao (1998). A primer on unit root testing. Journal of Economic
Surveys; Vol. 12 (6). Pg. 423-470.
Schwert, W. (1989). Test for Unit Roots: A Monte Carlo investigation. Journal of
Business and Economic Statistics; Vol. 7 (1). Pg. 147-159.
Said, S.E. and D. Dickey (1984). Testing for unit roots in autoregressive moving-average
models with unknown order. Biometrika; Vol. 71 (1). Pg. 599-607.
Stock, J.H. (1994). Units roots, structural breaks and trends. in R.F. Engle and D.L.
McFadden (eds.), Handbook of Econometrics, Vol. IV. (2).
17
Jeans, M. E. (1992). Clinical significance of research: A growing concern. Canadian
Journal of Nursing Research; Vol. 24 (1). Pg. 1-4.
Resnik, D. (2000). Statistics, ethics, and research: an agenda for educations and reform.
Accountability in Research; Vol. 8 (4). Pg. 163-88
Shamoo, A.E. and Resnik, B.R. (2003). Responsible Conduct of Research. Oxford;
University Press.
Shamoo, A.E. (1989). Principles of Research Data Audit. New York; Gordon and Breach.
Smeeton, N., Goda, D. (2003). Conducting and presenting social work research: some
basic statistical considerations. Britain Journal of Social Work; Vol. 33 (6). Pg. 567-
573.
Vogt, W.P. (2005). Dictionary of statistics & methodology. A Nontechnical Guide for the
Social Sciences. SAGE. Xnv
18