3-Data Considerations
3-Data Considerations
3-Data Considerations
Data Considerations
Importance of data
AI Hierarchy of MODEL: Deep
Needs: Learning
MODEL: Simple
ML Algorithms
1) Business 2) Data
understanding understanding • Split data
Customer • Select features
feedback • Prepare for modeling
3) Data
6) Deployment
preparation
5) Evaluation 4) Modeling
Module 3 Objectives:
At the conclusion of this module, you
should be able to:
Labels
Model
Model
Outputs
Prediction
Features
Real-time
Training Data: Features
• How to identify features?
– Subject matter experts
– Customers
– Temporal and geospatial characteristics
• How many features?
– Start with a small set and establish a
baseline
– Add more and evaluate impact
– Try everything logical – missing a key
feature is much worse than having an extra
Training Data: Labels
• For supervised learning, we need
labels/targets of what we are trying to
predict
• We may be able to source these labels or
may have to create them
• Our definition of the problem determines
the form of the labels
How much data?
• Generally, the more the better
• Orders of magnitude more observations
than number of features or labels
• Factors that influence data requirements:
– Number of features
– Complexity of the feature-target
relationships
– Data quality – missing and noisy data
– Desired model performance
How much data?
Dataset Size (observations)
Iris flower dataset 150
CheXpert chest xrays 224,000
ImageNet dataset 14 million
Google Gmail SmartReply 238 million
Google Translate trillions
Data Collection
Sources of data
• Internal data
– Log files / user data
– Internal operations & machinery
• Customer data
– Sensors
– Operational systems & hardware
– Web data – forms, votes, ratings
• External data
– Weather, demographics, social media, etc.
Best practices in collecting data
• Collect data intentionally
– Only what you need
– Beware of bias
– Representative data
• Update data as needed
– Environment changes
– Re-train model periodically
• Document
– Sources & metadata
Data labeling
• Sometimes label data can be collected, other
times it must be created
• Possible sources of labels/targets
– Log files
– Customer records
– Sensor readings
– User input
• Methods of creating labels
– Manual creation
– Commercial data labeling services
Collecting user data
• Many options for collecting user data:
– Forms
– User behavior
– Votes, rankings
• Ideally the data collection should:
– Be an integrated part of the user’s workflow
– Provide the user some benefit
• Creative examples:
– Google -> CAPTCHA
– StitchFix -> Keeping vs. returning
Flywheel effect
• Users generate data through interaction
with an AI-enabled system
• Data can be used to strengthen the AI and
open up further opportunities
• E.g. Amazon:
– Searches/purchases -> Reorder listings
– Purchases/ratings -> Personalized
recommendations
– Purchase records -> “Shoppers also
bought” (Co-occurrence matrix)
Cold start problem
• If we are relying on user-supplied data for
our model, we may initially not have
enough to build a quality model
• This is particularly challenging with
recommendation systems, where we face
it with every new user
• We may consider starting with heuristics-
based approaches, or adding a calibration
step to gather data
Data Governance & Access
Dealing with data silos
• One of the key barriers to broader use of
ML is that data is often siloed and
inaccessible
– Each department collects and manages its
own data
– Enterprise systems store data in different
places using different schemas
• For a company just starting to use ML, it is
wise to focus first on breaking down the
silos
Dealing with data silos
Breaking down data silos requires:
1. Cultural change
– Executive sponsor
– Education and incentives
2. Technology
– Centralized data warehouse
– Data querying tools
3. Data stewardship and access
– Responsibility for data stewardship
– Make it easy to discover and access
Facebook: democratizing data
• From 2007 to 2010 the data collected by
Facebook was exploding, and so were
requests for it (10k/day)
• Transitioned to a Hadoop cluster which
made access difficult
• Developed Hive, which allows users to
query data using SQL
• Moved to self-service model and made
training available to all employees
• Hackathons provided employees
opportunities to find creative uses of data
Data Cleaning
Data cleaning
• Messy data can prevent the development of
quality models
• Data can have several issues:
– Missing data
– Anomalous data
– Incorrectly mapped data
Missing data
• Data can be missing for a number of
reasons:
– Users did not provide (web form)
– Mistakes in data entry / mapping
– Issues with sensors (power, comms, failures)
Types of missing data
Missing Completely Missing at Random Missing Not at
at Random Random
Description No pattern in missing Probability of Probability of
data or association to missing-ness relates missing-ness relates
values of other to another feature of to values of the
attributes the data feature itself
Description • Statistical tests which rely • Train model on subsets • Extracts features
on characteristics of the of features which contribute
data only most to training of a
model
Data ML Model
Pipeline Project
Code
Data lineage
• Data lineage involves tracking of data from
source to consumption – how it was
transformed and where it moved
• Benefits:
– Enables debugging
– Simplifies data migrations
– Inspires trust in data
– Meets compliance requirements
• Options for data lineage
– Commercial data lineage systems
– Spreadsheet / graph software
Data lineage
• Common to visualize data lineage as a set of maps at different levels
• Record information on sources, characteristics, relationships,
transformations and locations
Document Document usage
Document ETL operations Document data pipelines
raw data in model
Source 1
Source 3
Model versioning
• Along with data lineage and code versioning,
model versioning is critical
• Benefits:
– Track modeling experiments: code, data, model
config, results
– Track production model and revert if necessary
– Run champion/challenger model tests