Data Cleaning Checklist
Data Cleaning Checklist
Data Cleaning Checklist
Data cleaning takes up 80% of the data science workflow. Use this
checklist to identify and resolve any quality issues with your data
Data type constraints Example A revenue_usd column that is a string, and Convert to the correct data type
not a numeric data type.
Ensuring that different columns have
the correct data type before revenue_usd revenue_usd
beginning analysis.
$1,000 1000
$3,210 3210
Uniqueness Constraints Example A duplicate row where the name and Keep only one of the exact
phone_number columns are identical, but not duplicate rows
Ensuring that there are no exact or the height_cm column
Merge rows that have non-exact
almost exact duplicates within your duplicate rows
rows. name height_cm phone_number
Membership constraints for Example Two different entries for “New York” in the Drop rows that are affected by
city column inconsistent categories
categorical data
Remap inconsistent categories to
Ensuring that categorical columns name city the correct category name
have correct and consistent Carl Rosseel New York Infer categories based on other
categories data points if it’s not clear how it
Sara Billen New York City
should be remapped
Length violation for text data Example A phone_number column that is 9 characters Drop rows that are affected by
instead of 14 length violation
Ensuring that text columns that follow Set affected observations to
a specific standard have the same name height_cm phone_number missing
string length
Carl Rosseel 177 (555) 200
Text data inconsistent Example A phone_number number column that Standardize formatting for affected
contains different phone number formats observations
formatting
Drop rows that are affected by the
Ensuring that text columns that follow name height_cm phone_number inconsistency
a specific standard have the same Carl Rosseel 177 (555) 200-5598
string formatting
Carl Rosseel 178 +1 555 2005598
Unit uniformity for numeric Example A temperature column in Celsius that has Dropping rows where no context on
absurdly high or low temperature values units appears and don’t pass a
columns
sanity check
Ensuring that numeric columns have date city temperature Standardize the units where
the same units (Temperature being possible
05-18-2022 New York 27
Celsius, or Fahrenheit across all
observations. This is especially 05-18-2022 New York 80.6
relevant when joining datasets from
different countries or sources.) * please note that absurdly high or low temperature
as sensor malfunctions
Unit uniformity for date Example A date column that contains dates with
Standardizing datetime formats
dd-mm-yyyy and mm-dd-yyyy formats where possible
columns
Dropping rows where no context on
Ensuring that date columns have the date birthday datetime format appears and don’t
same datetime format pass a sanity check
05-18-2022 Carl Rosseel
Crossfield validation for Example Flight bookings per class do not add up to the Dropping rows where sanity checks
total recorded bookings fail
numeric columns
Apply rules from domain knowledge
Crossfield validation is when we use date economy first class total based on knowing the data
multiple fields in a dataset to ensure 05-18-2022 250 50 300
the validity of another. For example,
ensuring that part to whole columns 05-19-2022 200 50 200
add to a relevant total
Crossfield validation for date Example A date_of_birth column that doesn't Dropping rows where sanity checks
correspond with the age column fail
columns
Apply rules from domain knowledge
Ensuring that date and temporal name date_of_birth age based on knowing the data
columns pass sanity checks (for John Doe 02-07-1994 27
example, ensuring that webinar
registration dates always precede Jane Doe 10-12-2000 34
webinar attendance dates)
relationship between missing data and Missing survey respondents by region of centrality such as median or
other observed values mean
algorithmic, machine-learning
Regio
n A
Regio
n B
Regio
n C
Regio
n D based approaches
40 °C
Missing data after
40 degrees celsius