Data Wrangling

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Lesson 8

Data Wrangling: Data Management and Data Cleaning Strategies

INTRODUCTION

The quality and organisation of data are essential for meaningful analysis in today's
data-driven society. Data wrangling is the process of converting raw data into a clear,
organised, and enhanced format that is prepared for analysis. It is sometimes referred to
as data munging or data rehabilitation. Finding and comprehending the raw data,
organising it for analysis, cleaning to remove errors and missing values, enriching with
more information, confirming accuracy, and releasing it for usage are the processes
involved in this process.
Making sure the data is dependable, precise, and easily available is the aim of
data wrangling, which frees up data scientists and analysts to concentrate on gleaning
insightful information. The procedure is essential for enhancing data quality and preparing
it for analysis, which facilitates better decision-making.

LEARNING OBJECTIVES
1. Recognize Data Wrangling's Significance
2. Note Important Data Wrangling Steps
3. Use Techniques for Data Cleaning
4. Use techniques for enhancing and organizing data
5. Check and Disseminate Information

LECTURE DISCUSSION

Raw data, which is sometimes called source data or primary data, is data that
have not been processed or prepared for analysis.
Data wrangling is the process of cleaning, transforming, and managing raw data
so it is more reliable and can be more easily accessed and used for analysis. This
process, which is also sometimes referred to as data munging or data remediation,
involves transforming raw data that cannot easily be analyzed into a new data set that is
designed specifically to support analysis.
Data wrangling activities include:
• Merging multiple data sources into a single data set
• Identifying missing values in data and either filling or deleting the record or
observation
• Identifying erroneous values in data and replacing them with the correct
values
• Identifying and deleting duplicate observations or records in data
• Identifying extreme values in data, determining whether they are legitimate,
and taking appropriate action
• Sub-setting data to facilitate analyses and comparisons of subgroups of the
raw data.

The data wrangling process can be manual or automated, or it can be a


combination of manual and automated. The process can also be ad hoc or systematic.
As the size, complexity, and importance of the raw data increases, an automated and
systematic approach to data wrangling becomes more necessary. No matter how it is
accomplished, the objective of data wrangling is to facilitate the efforts of data scientists
and analytics professionals to focus on their analyses and the extraction of meaningful
information from the data.
Although each data wrangling project is unique, the objective is to produce a final
data set that is accurate, reliable, and accessible. This objective leads to a series of
steps that are commonly taken in the data wrangling process:
• Discovery: The goal of the discovery phase is for the analyst to become familiar
with the raw data in order to be able to conceptualize how it might be used.
During this step, the analyst often discovers potential issues that will need to be
addressed later in the data wrangling process. Just as data wrangling is a
precursor to meaningful analysis, discovery is a precursor to the data wrangling
steps that follow, and this step should not be overlooked or neglected.
• Structuring: In the structuring phase, the analyst arranges the raw data so it can
be more readily analyzed in the intended manner.
• Cleaning: The goal of the cleaning phase is to find and correct errors in the raw
data that might distort the ensuing analyses. This can include addressing missing
or incorrect values.
• Enriching: In this step, the analyst may augment the raw data by incorporating
values from other data sets and/or applying transformations to portions of the
existing data to ensure that all data required for the ensuing analyses will be
included in the wrangled data set.
• Validating: In the validating phase, the analyst verifies that the wrangled data
are accurate and reliable and that they are ready for the ensuing analyses.
• Publishing: In this step, the analyst creates a file containing the wrangled data
and documentation of the file’s contents, and makes these available to their
intended users in a format they can use.

Discovery
Data wrangling starts with discovery, as the analyst becomes familiar with, and
understands, the raw data with an eye toward how the data should be organized to
facilitate its use and analysis. The analyst should come to this step with a thorough
understanding of how these data may be analyzed and consider this as the discovery
process proceeds. This understanding includes how the raw data can be accessed, the
structure of the raw data, whether the data are numerical or text, how they are arrayed or
arranged, and the types of file(s) that contain the raw data.

Accessing Data
Accessing the data involves importing the raw data into whatever software is to be used
to wrangle the raw data. Raw data can come in many formats, including:
• Delimited text (txt) files, comma separated value (csv) files, JavaScript Object
Notation (JSON) files, XML files, Excel (xlsx) files, and HTML files
• Databases such as ACCESS, SQL, or ORACLE
• Web pages/HTML files

See example and procedure in:


Business Analytics, Fifth Edition Jeffrey D. Camm, James J. Cochran, Michael J. Fry,
Jeffrey W. Ohlmann: Chapter 4 Data Wrangling: Data Management and Data Cleaning
Strategies, page 153-158

Structuring

Once the analyst understands the raw data and how it should be organized to
facilitate its analysis, the focus turns to structuring the raw data file so it can be more
readily analyzed in the intended manner. This may include the formatting of the data
fields, how the data are to be arranged, splitting one field with several important pieces
of information into several fields, and merging several fields into a single more meaningful
field.

Data Formatting
The two primary forms of data are numeric and text. Numeric data are data that
are stored in a manner that allows mathematical operations to be performed on them.
Data of this type generally represent a count or measurement. Software can often be
used to apply specific formats to fields in raw data. For example, Excel offers several
formats that can be applied to numerical data, including Currency, Accounting, Long Date,
Short Date, Time, Percentage, and Scientific Notation.

Arrangement of Data
We have considered data sets that are arrayed in a manner such that every row
(record) represents an observation on a set of variables (fields), and the values of those
variables are listed in the columns. However, there are multiple ways to organize the data
within this structure, particularly if the data contain categorical variables.

• Stacked data are data organized so that the values for each variable are stored
in a single field.
• Unstacked data are data organized so that the values for one variable correspond
to separate fields.

Splitting a Single Field into Multiple Fields

Sometimes one field in the raw data contains several pieces of information that
would be more useful if divided into separate fields. For example, dates, addresses, and
times often contain multiple pieces of information that may be more useful if divided into
separate fields. Consider the Address field in the file tblbrewersaddresses shown in
Figure 4.2. The information contained in this field might be easier to use if the street
address, city, state, and zip code were each in its own field. Such an arrangement would
allow us to sort on City or Zip Code and analyze the data in that manner.
Many data wrangling software packages provide tools for splitting a single field into
multiple fields. In Excel, on the Data tab in the Data Tools group, the Text to Columns
button invokes the Convert Text to Columns Wizard that can be used for this purpose.
Useful operations that can be performed using the Convert Text to Columns Wizard
include:

Combining Multiple Fields into a Single Field

Sometimes several fields in the raw data contain information that would be more
useful if combined into a single field. For example, if we received the data shown in Figure
4.14 but needed the prefix, first name, middle name, and last name all in a single field
(perhaps to facilitate the creation of shipping labels), we would want to combine these
four fields into a single field.

Cleaning

Once the discovery and structuring phases of data wrangling are complete, the
analyst moves on to cleaning the raw data. This step includes identifying missing data,
erroneous data, duplicate records, and outliers and determining the best actions for
addressing these issues. Data cleaning makes heavy use of descriptive statistics and
data-visualization methods to identify missing data, erroneous data, and outliers.

Missing Data

Data sets commonly include observations with missing values for one or more
variables. In some cases, there is an appropriate reason for a value of a field to be
missing; these are called legitimately missing data.
For example, respondents to a survey may be asked if they belong to a fraternity
or a sorority, and then in the next question are asked how long they have belonged to a
fraternity or a sorority. If a respondent does not belong to a fraternity or a sorority, they
should skip the ensuing question about how long. Generally, no remedial action is taken
for legitimately missing data.
In other cases, there is not an appropriate reason for the value of a field to be
missing; these are called illegitimately missing data. These cases can result for a
variety of reasons, such as a respondent electing not to answer a question that they are
expected to answer, a respondent dropping out of a study before its completion, or
sensors or other electronic data collection equipment failing during a study. Remedial
action is considered for illegitimately missing data. If possible, the original source of the
data should be checked for all instances of illegitimately missing data to establish whether
the values that are missing can be determined. If the missing value can be found in the
source of the data, the verified value from the source of the data should be used to replace
the missing value.
The primary options for addressing missing data if the missing value cannot be
found at its source are:
1. To discard records (rows) with any missing values;
2. To discard any field with missing values;
3. To fill in missing entries with estimated values, or;
4. To apply a data-mining algorithm (such as classification and regression trees) that
can handle missing values.
Deciding on a strategy for dealing with missing data requires some understanding of
why the data are missing and the potential impact these missing values might have on an
analysis. If the tendency for an observation to be missing the value for some variable is
entirely random, then whether data are missing does not depend on either the value of
the missing data or the value of any other variable in the data. In such cases the missing
value is called missing completely at random (MCAR). For example, if a missing value
for a question on a survey is completely unrelated to the value that is missing and is also
completely unrelated to the value of any other question on the survey, the missing value
is MCAR.
However, the occurrence of some missing values may not be completely at random.
If the tendency for an observation to be missing a value for some variable is related to the
value of some other variable(s) in the data, the missing value is called missing at random
(MAR). For data that are MAR, the reason for the missing values may determine its
importance. For example, if the responses to one survey question collected by a specific
employee were lost due to a data entry error, then the treatment of the missing data may
be less critical. However, in a health care study, suppose observations corresponding to
patient visits are missing the results of diagnostic tests whenever the doctor deems the
patient too sick to undergo the procedure. In this case, the absence of a variable
measurement provides additional information about the patient’s condition, which may
aid in understanding other relationships in the data.
A third category of missing data is missing not at random (MNAR). Data is MNAR if
the tendency for the value of a variable to be missing is related to the value that is missing.
For example, survey respondents with high incomes may be less inclined than
respondents with lower incomes to respond to the question on annual income, and so
these missing data for annual income are MNAR.

Enriching
The discovery, structuring, and cleaning steps of the data wrangling process
provide a deeper understanding of the raw data. In the enriching step, we use this
understanding in thinking about what information would be useful to have in the data set.
Here we consider how we might want to simplify the data by selecting the subset of fields
or records that will be useful for the ensuing analyses, supplement the raw data with
additional data, or enhance the raw data through transformations on its fields. Although
enriching the data is an optional step, it can be an extremely important step if the raw
data does not meet requirements for the ensuing analysis.
Subsetting Data

Subsetting is the extraction of fields and records that will be useful for ensuing
analyses from data. This creates a more manageable data set that can be processed and
analyzed more efficiently. Some software subsets data by selecting or retaining the fields
and records that will be useful for ensuing analyses, and other software subsets data by
excluding or dropping the fields and records that will not be useful for ensuing analyses.
In either approach, the result is a new data set that contains only the fields and records
that are to be used in the upcoming analyses.

Supplementing Data

We often identify additional data that we want to add to the records in the raw data
or receive raw data in multiple files. In these instances, these data must be combined into
a single data set if they are all potentially relevant to the ensuing analyses. If we have
different data for individual records in multiple files, we combine the data from these files
by appending fields. If each record occurs no more than once in each data set, we
combine the data for each unique record by using a one-to-one merger of the data sets.
If we have different data for individual records in multiple locations and at least one record
occurs more than once in at least one of the data sets, we combine the data for each
unique record by using a one-to-many merger of the data sets. These operations are
generally similar.

Enhancing Data

When we make a transformation on one or more fields in a data set, we change


the format, structure, or values of data in a well-defined systematic manner. This is
sometimes done for presentation; for example, an analyst may reformat a field to show
as a percentage because that will communicate more effectively with the target audience.
However, transformations on fields are also frequently used to facilitate modeling.

Validating and Publishing

We conclude the steps in the data wrangling process with ensuring the data are
accurate and accessible. This is accomplished through validating and publishing.

Validating
In the validation step, the analyst checks the accuracy and consistency of data
before it is made available for analysis or used for other business operations. This
includes i) checking that the information in the wrangled data correctly represents the
information provided by the raw data and ii) determining that the fields in the data sets
(particularly new fields created in the data wrangling process) are defined in the same
manner for all records in the data.

Publishing
Once the raw data have been wrangled into a high-quality data set that is ready
for analyses, we enter the publishing step of data wrangling. At this point, we make the
newly wrangled data available in a location that make it easy to access and in a format
that is easy to use. If we fail to accomplish either of these goals in the publishing step, all
the work we performed in the first five steps of the data wrangling process is fruitless.
In addition to making the wrangled data available in a format suitable for analyses,
a data dictionary must be created for the data in the publishing step. The data dictionary
documents characteristics of the data such as names and definitions of the fields, units
of measure used in the fields, the source(s) of the raw data, relationship(s) of the wrangled
data with other data, and other attributes. Thus, an effective data dictionary is critical
because it ensures that everyone who uses the wrangled data can understand the data
and use them in appropriate ways across various analyses.

Watch:

EXCEL 102: Data Wrangling in Excel (youtube.com)

Read
Camm, J. D., Cochran, J. J., Fry, M. J., & Ohlmann, J. W. (2024). Business analytics
(5th ed.). Cengage Learning, p152-190

You might also like