Data Cleaning Checklist

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Data Cleaning Checklist

Data cleaning takes up 80% of the data science workflow. Use this
checklist to identify and resolve any quality issues with your data

Checklist Examples in Action Potential Solutions

Data Constraints Problems

Data type constraints Example A revenue_usd column that is a string, and Convert to the correct data type
not a numeric data type.
Ensuring that different columns have
the correct data type before revenue_usd revenue_usd
beginning analysis.
$1,000 1000

$3,210 3210

Data range constraints Example A gpa column should be constrained


Check for typos, like a decimal
to [0.0, 4.0] point in the wrong place.
Ensuring that different columns have Drop rows where data points break
the correct range. This is especially gpa range constraints
the case for columns that have limits.
3.1 Set the data point that breaks
range constraints to the maximum,
4.0
or minimum
5.1 Treat the data point that breaks
range constraints to missing, and
impute it

Uniqueness Constraints Example A duplicate row where the name and Keep only one of the exact
phone_number columns are identical, but not duplicate rows
Ensuring that there are no exact or the height_cm column
Merge rows that have non-exact
almost exact duplicates within your duplicate rows
rows. name height_cm phone_number

Carl Rosseel 177 (555) 200-5598

Carl Rosseel 178 (555) 200-5598

Text and Categorical Data Problems

Membership constraints for Example Two different entries for “New York” in the Drop rows that are affected by
city column inconsistent categories
categorical data
Remap inconsistent categories to
Ensuring that categorical columns name city the correct category name
have correct and consistent Carl Rosseel New York Infer categories based on other
categories data points if it’s not clear how it
Sara Billen New York City
should be remapped

Length violation for text data Example A phone_number column that is 9 characters Drop rows that are affected by
instead of 14 length violation
Ensuring that text columns that follow Set affected observations to
a specific standard have the same name height_cm phone_number missing
string length
Carl Rosseel 177 (555) 200

Carl Rosseel 178 (555) 200-5598

Text data inconsistent Example A phone_number number column that Standardize formatting for affected
contains different phone number formats observations
formatting
Drop rows that are affected by the
Ensuring that text columns that follow name height_cm phone_number inconsistency
a specific standard have the same Carl Rosseel 177 (555) 200-5598
string formatting
Carl Rosseel 178 +1 555 2005598

Data Uniformity Problems

Unit uniformity for numeric Example A temperature column in Celsius that has Dropping rows where no context on
absurdly high or low temperature values units appears and don’t pass a
columns
sanity check

Ensuring that numeric columns have date city temperature Standardize the units where
the same units (Temperature being possible
05-18-2022 New York 27
Celsius, or Fahrenheit across all
observations. This is especially 05-18-2022 New York 80.6
relevant when joining datasets from
different countries or sources.) * please note that absurdly high or low temperature

values can be caused by other data quality issues, such

as sensor malfunctions

Unit uniformity for date Example A date column that contains dates with
Standardizing datetime formats
dd-mm-yyyy and mm-dd-yyyy formats where possible
columns
Dropping rows where no context on
Ensuring that date columns have the date birthday datetime format appears and don’t
same datetime format pass a sanity check
05-18-2022 Carl Rosseel

05-19-2022 Sara Billen

20-05-2022 sabella Leslie-Miller


I

Crossfield validation for Example Flight bookings per class do not add up to the Dropping rows where sanity checks
total recorded bookings fail
numeric columns
Apply rules from domain knowledge
Crossfield validation is when we use date economy first class total based on knowing the data
multiple fields in a dataset to ensure 05-18-2022 250 50 300
the validity of another. For example,
ensuring that part to whole columns 05-19-2022 200 50 200
add to a relevant total

Crossfield validation for date Example A date_of_birth column that doesn't Dropping rows where sanity checks
correspond with the age column fail
columns
Apply rules from domain knowledge
Ensuring that date and temporal name date_of_birth age based on knowing the data
columns pass sanity checks (for John Doe 02-07-1994 27
example, ensuring that webinar
registration dates always precede Jane Doe 10-12-2000 34
webinar attendance dates)

Missing Data Problems

Missing Completely at There is no observed relationship


Random Data between missing data and other
values within the dataset
W hen there is no systematic
relationship between missing values
and other values within the dataset

Drop missing rows


Missing at Random Data Missing census data from a specific region, because the
postal service doesn’t have full coverage in that region
W hen there is a systematic mpute missing rows with measures
I

relationship between missing data and Missing survey respondents by region of centrality such as median or
other observed values mean

mpute missing rows with


I

algorithmic, machine-learning
Regio
n A
Regio
n B
Regio
n C
Regio
n D based approaches

Collect new data points and


Missing Not at Random Data Temperature readings from a sensor missing because features
temperate was too low, or high
W hen there is a systematic
relationship between missing data and
other unobserved values Temperature (°C) —
Sorted lowest to highest

40 °C
Missing data after
40 degrees celsius

You might also like