SIM - Chapters - DA T2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2

DATA PREPARATION

LEARNING OUTCOMES
At the end of this topic, you should be able to:
• Describe the characteristics of quality data
• Apply the techniques to treat anomalies in a data set

INTRODUCTION
In Topic 1, you have learnt that a data analytics project workflow consists of raw data collection, data
preparation, modelling and deployment. It is commonly said that data scientists spend 80% of their time
in preparing the data and only 20% of their time analysing it. Data preparation is vital since analysing
dirty data can lead to inaccurate conclusion. In this topic, you will learn about the basic characteristics
of quality data and how can you treat anomalies in a data set.

1.1 QUALITY DATA


Data quality and data understanding is the first and the most important part of any data analytics project.
It ensures that the data is of certain quality and is fit enough to be used in model development. The key
to a good data analytics project is quality data. A chef will use good ingredients to ensure the food that
he makes is delicious. The same analogy can be applied in data analytics - in order for you to produce
good analytics results, you will need to have a good dataset.

There are three basic questions that you need to ask when you deal with data:
1. Is the data relevant?
2. Are the data connected?
3. Do you have sufficient data?

To understand relevancy in the dataset, let’s say you want to guess the frequency of crime occurrence
in a particular city. You have two sets of data as follows:

Dataset 1:
Dataset 2:

Can you guess the frequency of crime occurrence if you have data on diabetes patient count and number
of babies as in Dataset 1? In relating with the crime rate in the city, it is difficult to relate between
diabetes, baby born and frequency of crimes. Dataset 2 is more relevant as we can see correlation
between the number of police, illegal immigrants and frequency of crime in the city.

To understand connectivity between dataset, ley sat we want to predict corrosion rate. We have the
following dataset that consists of sensor recording of salt, alkalinity (PH), iron oxide, chloride,
temperature and thickness of the wall (CR):
We can see missing values in the dataset, which indicates that the data inputs are not connected. So
much missing data can result in less accurate corrosion predictions.

To understand the sufficiency of our data, let’s say we have the following image:

Can you make a decision which part of the town would you rent a room based on this image? You most
probably cannot make the decision due to the low-resolution image that we have. A much clearer image
of the town may assist you in making more quality analysis. A higher resolution and quality image of
the town can assist you further in your analysis and better-informed decision.

To make good decision, we must have complete and sufficient data. The data must also be correlated
to each other to ensure that it is relevant for our analysis. The second step after data quality is to select
which data inputs are correlated. Correlation between data inputs is usually measured using Chi-squared
test, while relevancy between each data input and a target is measured using R-squared test. The value
of more than 70% of the R-squared test commonly indicates that the data inputs can be used for model
development. However, this value also depends on certain application domain which requires domain
expertise knowledge. Let’s watch the video of a case study on correlation between data inputs and the
target (refer the video “Case Study: Correlation Study” in ULearnX - video on corrosion correlation
study from Dr Izzatdin’s slides)

1.2 TREATING DATA ANOMALIES


Reality wise, real-world dataset is too frightening. Too often we will get the data that is questionable
with too many empty cells. How do we predict the missing values, and what more to predict the
outcome?

One example of data problems is anomalies. Among the methods that can be used to treat anomalies
are posterior prediction distribution, listwise deletion and averaging using mean. You have seen how to
perform posterior prediction distribution from the previous video on correlation case study. Listwise
deletion means that any row in a data set is deleted from an analysis if they are missing data on any
variable or data inputs in the analysis. The rule of thumb in using listwise deletion technique is the noise
that that we have within the data has to be less than 5% of the data. If we perform listwise deletion on
the data with more than 5% of noise, we will lose legitimate values and it will cause distortion in our
model accuracy. Averaging using mean, or also called mean imputation is a method in which the mean
of all values in for each variable is computed and the missing values for that variable are imputed or
replaced with the mean.

Let’s watch the video of a case study that perform data preparation on a data set which includes listwise
deletion, replacing some values, correlation check, and variable importance (refer the video “Case
Study: Data Preparation” in ULearnX - video on data preparation for machine failure prediction from
Dr Izzat’s slides)

1.3 DATA TRANSFORM AND VARIABLE IMPORTANCE


To build a “clever” Machine Learning algorithm, we need to train our algorithm with a training set that
consist of multiple classes / with variable of conditions.
**Video:
https://recordings.roc2.blindsidenetworks.com/utp/9fae5da41c405159fd1fa930136bb0345ac2dde5-
1620694069515/capture/

In machine learning modelling, we have to choose the parameters (i.e., variable) of data to be the target
and the predictor. The former and the latter are also known as Y and X variables. In some circumstances,
we may find that not all variables in our data are relevant for modelling. Hence, we require the
correlation analysis to evaluate the importance of each X variables against the Y variable. In the
analysis, we have to evaluate the R2, Signifiance and P-value measurements.
**Video:
https://recordings.roc2.blindsidenetworks.com/utp/9fae5da41c405159fd1fa930136bb0345ac2dde5-
1620694069515/capture/

SELF-LEARNING ACTIVITY
The previous video related to data preparation for machine failure prediction. You are going to prepare
the same data set for a simple prediction model to predict how long will a machine stays operational
until performance degradation happens and eventually fails. This event is known as Remaining Useful
Life (RUL). The steps in preparing the data must include:
1. Data transformation – aggregation, introducing time step to failure
2. Plotting the trendline – determine the linear and failure threshold
3. Using the cleanse data set, record the TIME STEP to failure as per below:
• Each FAILURE event can be considered as a “Single” EVENT – this can be done through
aggregation function.
• Include all Running Event in your new tab

SUMMARY
As you have learnt in this topic, you must ensure that the data that you are going to use for modelling
or analysis is relevant, connected and sufficient. You also learn some of the methods that you can use
to handle anomalies in the data set. The following topic will be on the data quality management of
which you are going to be introduced to a data quality framework used to check for quality data.
KEYWORD
data quality, anomalies, missing values, listwise deletion, imputation
REFERENCES

N/A

You might also like