Data Products
Data Products
Data Products
Products:
The Ultimate How-To Guide
Design and ship valuable data products your
company will actually use.
Table of Contents
You can be expansive in the type of asset that is defined as a data product
(is it an A/B testing platform, a multi-layered data platform, a data set, who
cares?), but you must be exacting in the criteria (which we delve into in
Chapter 2).
But these challenges can also be an opportunity for savvy data teams.
Here’s how.
Data as a product is a lens and discipline that can help data teams create
more scalable, reliable data systems. This comes with valuable benefits
such as:
Increased trust — Most business stakeholders will trust the data until
they have a reason not to, but it only takes a few instances of
conflicting or missing data to lose that benefit of the doubt. An
increased focus on data quality ensures you retain data trust, which
is important because, simply put, you cannot create a data-driven
organization without it.
Way to measure quantitative value of the data team —Too many
data teams are evaluated based on a feeling. Stakeholders “feel like
they are doing a good/bad job.” By treating data like a product you
will be able to provide hard metrics to demonstrate how your team is
performing and contributing to the business. That never hurts when
asking for more resources.
The concept of data product raises the bar for the availability, reliability, and
usability of enterprise data, but it also represents a profound shift in
perspective.
Data engineers need to measure data quality and data downtime just like
our friends in software engineering who have addressed application
downtime with specialized frameworks such as service level agreements,
indicators, and objectives (SLA, SLI, SLOs).
Why go through the process of codifying SLAs at all, if you don’t have a
customer pressuring you to commit to certain thresholds in a contract?
Why not just count on everyone to do their best and shoot for as close to
100% uptime as possible? Isn’t this just introducing unnecessary red tape?
Not at all. The very practice of defining, agreeing upon, and measuring key
attributes of what constitutes reliable software (or in our case data
pipelines) can help engineering, product, and business teams align on what
actually matters most and prioritize incoming requests.
With SLAs, different engineering teams and their stakeholders can be
confident they’re speaking the same language, caring about the same
metrics, and sharing a commitment to clearly documented expectations.
So how do you go about setting effective SLAs? For this, let’s turn to the
real world experience of Brandon Beidel, director of product management
(data), at Red Ventures.
His first step was to develop a solution for measuring data quality, which in
his case was to leverage a data observability solution with automated
monitoring, alerting, and reporting for data incidents. This is a critical step
in setting realistic SLAs and improving on them–you can’t improve what
you can’t measure.
He met with every business team in a weekly cadence, and without ever
using the term “SLA,” started having discussions around data quality and
how it impacted their day to day.
The conversation was framed around simple business terms, how the data
was being used, and the overall “who, what, when, where, and why.”
Specific questions included:
He then created a SLA template for his data team to create and uphold
SLAs across the business (the examples in the template below are ours)
and started tracking performance of what percentage of SLAs were being
met by each data warehouse in a dashboard.
Assigning Ownership
With the creation of data SLAs you now know who on the business side is
using and impacted by your data. The next step is to create ownership and
accountability within the data team at the organizational, project, and
pipeline/table levels
Let’s start with creating accountability at the organizational level. So, who
in your data organization owns the reliability piece of your data ecosystem?
As you can imagine, the answer isn’t simple. From your company’s CDO to
your data engineers, it’s ultimately everyone’s responsibility to ensure data
reliability. And although nearly every arm of every organization at every
company relies on data, not every data team has the same structure, and
various industries have different requirements.
For instance, it’s the norm for financial institutions to hire entire teams of
data governance experts, but at a small startup, not so much.
Notably, the brunt of any bad choices made here is often borne by the BI
analysts, who’s dashboards may wind up containing bad information or
break from uncommunicated changes. In very early data organizations,
these roles are often combined into a jack-of-all-trades data person or a
product manager.
Project Ownership
Now, let’s turn our attention to the project level. Just like every product will
have a product manager, every data product should have a data product
manager (one data product manager can oversee multiple products).
This is one of the most overlooked roles on the modern data team, but it’s
essential according to Wendy Turner-Williams, the Chief Data Officer at
Tableau.
We also believe data product managers are critical to tackle tasks that build
long-term value, but aren’t in the purview of the typical day-to-day work of
the data engineer. This includes acquiring user feedback, gathering team
requirements, evaluating new technologies, building feature roadmaps,
and more.
For example, in one of his former roles, Atul Gupte defined the product
strategy and direction for Uber’s data analytics, data knowledge, and data
science platforms. Specifically, he led a project to improve the
organization’s data science workbench that was utilized by data scientists
to make it easier to collaborate.
Whereas a traditional engineering project lead may have tried to add more
virtual machines or extend the project timeline, Atul researched multiple
solutions and identified virtual GPUs (then an emerging technology) as a
possible solution.
While there was a high price tag, Atul justified the expenditure with
leadership. The project was not only going to save the company millions,
but supported a key competitive differentiator.
This proactive approach allowed Uber to start building the foundation they
would need to leverage GPUs immediately upon availability. Time to value
was greatly accelerated–a hallmark of a good data product manager.
Pipeline/Table Ownership
For most organizations, this is the data engineer who built or has the most
experience with the pipeline or table that is experiencing the issue. That is a
great system, but it needs to be transparent and well documented across
the data team and with relevant stakeholders.
One of the simplest ways to do this is to identify your key tables (the ones
that are populating your most important dashboards or that have the most
read/writes) and label the owner. Data observability and data catalog
solutions can both help identify key tables as well as serve as a reference
point for ownership.
Monte Carlo allows owners to be assigned to tables along with other tags.
Speaking of documentation…
And at that point, all ad-hoc questions start coming your way and instead of
a data engineer, you become a data catalog. (You also need good
documentation for governance, compliance, and quality use cases, and
more on that in the subsequent sections, but here we’re going to focus on
documentation to enable self-service).
The challenge is your ability to pipe data is virtually limitless, but you are
constrained by the capacity of humans to make it sustainably meaningful.
In other words, you can never document all of your data, and rarely will you
have as much documentation as you’d like.
So how can you solve this problem? In one word: focus. In many words:
Fight data sprawl —More data doesn’t simply translate into more or
better decisions, in fact it can have the opposite effect. Ensure your
data ecosystem prioritizes usability and organization. Some
strategies include gathering data consumer needs upfront and only
pipe that data, creating a consumer facing warehouse that is kept
tidy by a layer of analytics engineers, or deprecating legacy data
sets.
To build a great data product, you need to factor in data compliance and
data governance, two slightly different topics with equal importance.
Governance
It remains vital for data products, perhaps even more so as data volume
levels rise. However, for these initiatives to succeed they need to be
iterative, domain focused, and decentralized.
Compliance
The bottom line for data products, and for data leaders in general, is that
regulators require companies to say what they do, do what they say, and
then prove it.
Understand the data regulations to which the data product is beholden (it
often varies by country, industry, and even municipality) and document the
compliance process. Most data regulations will focus on how personally
identifiable information is acquired (chain of consent), secured/stored
(more data sovereignty laws are passed every year), and that it is only
accessible to those with a legitimate business need.
Our customers in particular find this feature useful for tagging personally
identifiable information (PII) data. For example, tagging a column with
phone numbers as PII = “Phone Number.” Other helpful compliance
features include Dynamic Data Masking and Row Access Policies.
Certifying Data Sets
Phew! We’ve done a lot of work so far to make our data product reliable and
accessible. But how do we let our consumers know that the data sets
powering them can be used with a higher degree of certainty to fuel more
critical and valuable decisions and automations?
This helps with a few things. For one, as previously noted, it helps data
consumers understand the level of trust they can have in the data.
However, it also prevents data users from the facepalming mistake of
leveraging the wrong asset, what we call the “You’re using THAT table?!?”
problem.
Finally, it can help with focus (notice a theme yet?). Treating every table or
data asset with the same amount of rigor is a recipe for burnout and
disaster.
Data Contracts
Data breaks bad in all sorts of ways, but one of the most frequent data
quality issues arises when software engineers push service feature updates
that change how data is produced. Unbeknownst to the software engineer,
there is an entire data ecosystem leveraging that data and their new update
has gummed up the works.
And how could they know? In most cases, software engineers and other
upstream stakeholders do not have a comprehensive list of key data fields
and how they are being used across the business. This is where data
contracts come into play.
Data contracts are similar to an API in that they serve as documentation and
version control to allow the consumer to rely on the service without fear it
will change without warning and negatively impact their efforts.
It starts with the data team using Jsonnet to define their schemas,
categorize the data, and choose their service needs. Once the JSON file is
merged in Github, dedicated BigQuery and PubSub resources are
automatically deployed and populated with the requested data via a
Kubernetes cluster.
For example, this is an example of one of our data contracts, which has
been abridged to only show two fields.
When you are a startup, sometimes the focus is on customer usage and
satisfaction rather than a hard ROI. Make customers happy and
monetization will come later. It can be the same for small companies or
young data teams–using adoption and data consumers as your north star
for your data product will never leave you too far astray.
That being said, data teams need to hold themselves accountable to the
same data-driven KPIs that their dashboards are driving for the rest of the
company. The act of planning is valuable in itself, but demonstrating the
value of a data product is important for obtaining and investing resources.
Don’t forget you wrote down to talk to your People team about hiring a data
product manager after reading the ownership section of this chapter.
We recommend leveraging two core sets of KPIs. The first would be specific
to the data and platform. These can measure:
Let’s start with the breadth of customers and applications. Apply the same
level of rigor to your measurement of platform usage as you would a
customer-facing product. Measure the footprint and speed of adoption, and
run surveys to understand their satisfaction. While you should keep these
surveys very short, it can also be an opportunity to gauge the drivers of
their satisfaction, such as trust in the data / tools and ease of use. A
scorecard of survey results might look like this:
You also want to determine if your data product is falling behind, keeping
pace with, or forging ahead of the business needs. A scorecard might look
like this:
At the highest level, you must be able to quantify the impact driven by the
data product, either in monetary terms ($, €, £) or a common unit of
measurement that determines business success (e.g. sales, subscriptions).
Precision is not the goal here – you can often tell if your platform bets are
paying off by the order of magnitude of the value generated.
The second set of KPIs should be shared goals, or how the data team has
helped achieve business level objectives. For example, if you agree with
engineering, product, and marketing that onboarding is a pain point, you
can decide to build goals and KPIs around making it easier for new
customers to get started.
By aligning the company around the shared goal of reducing new tool
onboarding from five days to three days, for instance, you could begin to
address the problem holistically: your data team gathers metrics on usage
and helps build A/B tests, while your engineering team modifies the
product, and your marketing team creates nurture campaigns. This is what
it looks like to define and execute against a company-wide goal.
These goals also need to align with the culture of the business. For
example, some organizations are going to be excited about how the data
product has accelerated processes and others are going to be excited
about how it has mitigated risk.
The solution? The now widely adopted concept of DevOps, a new approach
that mandates collaboration and continuous iteration between developers
(Dev) and operations (Ops) teams during the software deployment and
development process.
Test: Testing your data to make sure it matches business logic and
meets basic operational thresholds (such as uniqueness of your
data or no null values).
Data products are a relatively new concept and there are a lot of
misconceptions around it. That’s why it’s important to clarify a few things
we intentionally didn’t specify because in our opinion they aren’t necessary
criteria for creating reliable, scaleable, well-adopted data systems.
An external data product is any data asset that either faces or impacts a
customer. That can range from a dataset that is used in the customer billing
process to a completely separate data intensive application with its own UI
providing insights to a customer.
One of the hottest trends in data right now involves companies creating
data applications or adding an additional layer within their SaaS product to
help customers analyze their data. For example, a point of sale SaaS
company might provide data back to its customers to better help it
understand its sales trends.
We are a data intensive SaaS application that monitors, alerts and provides
lineage within our own UI. We also provide insights reports back to
customers within our UI as well as providing them the option to surface it
within their own Snowflake environment using the Snowflake data share
integration.
In this latter case, we’re just providing the building blocks for customers to
be able to further customize how they’d like to visualize or combine it with
other data.
But developing an external data product is a cut above: both in value added
and in level of difficulty. It’s a different motion that requires your team to
build new muscle memory.
It’s a new way of thinking and requires elevated levels of coordination,
discipline and rigor.
That isn’t to say it can’t be done by the same team, or that your internal
data consumers can’t receive the same level of service as your external
customers.
Let’s dive into a few places where the bar is raised for external data
products before focusing on perhaps the single most important
differentiating factor–architecture.
User Expectations
Let’s say you have a data product that your user can usually trust to help
answer some of their questions. The data is refreshed every day and the
dashboard has some clickable elements where they can drill in for a more
granular examination of the details.
That might be enough for some internal users. They can get their job done
and their performance has even improved from when they didn’t have
access to your slick dashboards.
On the other hand, your external users are pissed. They want to trust your
product implicitly and have it answer all of their questions, all of the time,
and in real-time.
And why shouldn’t they be upset? After all, they are paying for your product
and they could have gone with a competitor. When data is the product, data
quality is product quality.
Yes Derek, data IS the product. No, the files aren’t actually IN the computer.
This simple fact is why some of the most enthusiastic adopters of our data
observability platform are leveraging it to support their data applications.
In regards to scale and speed, external customers never want to wait on the
data and they want more data dimensions so they can slice and splice to
their heart’s content.
For example, one of our financial services customers has not just been
focused on data freshness, but on data latency, or in other words the ability
to load and update data in near real-time while also supporting queries.
Another customer, advertising platform Choozle, found additional data
dimensions helpful for their upgraded platform:
“Snowflake gave us the ability to have all of the information available to our
users. For example, we could show campaign performance across the top
20 zip codes and now advertisers can access data across all 30,000 zip
codes in the US if they want it,” said Chief Customer Officer Adam Woods.
User expectations for external data products, and the number of data
dimensions provided, are high.
ROI
The vast majority of data teams are not evaluated against a hard return on
investment. While we advocate building KPIs and assessing value as part of
your internal products, it is an absolute necessity to be capable of
accurately measuring value when building an external data product.
Product managers need to understand how to price it, and it must be
profitable (at some point). They will need to know the start up costs for
building the product as well as how much each component costs in
providing the service (cost of goods).
This can be challenging for data teams that haven’t had the pleasure of
building internal chargeback models for their data products that can
differentiate, track, and charge customers according to scale of use.
Self-Service
“A-ha!” you say. “Our team already allows our internal consumers to self-
service, this is nothing new.”
That may be true, but the bar has been raised for self-service and usability
too.
Your external customers can’t Slack you to ask questions about the data or
how you derived this customer’s likelihood to churn was, “3.5 out of 5
frowny faces.” The data product can’t be a black box–you need to show
your work.
Is 3.5 frowny faces good? Why are they green, shouldn’t it be red? How is
this calculated? I need more context from this hypothetical product!
How the user consumes and interfaces with your external data product
requires a second thought as well. For many data teams the answer for their
internal data product is, “…and then it’s surfaced in Looker.”
When you build your internal data products, it’s often slow going at first as
you gather requirements, build, and iterate with business stakeholders.
After that, teams are often off and running to the next project. There will be
patches and fixes for data downtime or maybe to meet internal SLAs if
you’re fancy, but on the whole you aren’t refactoring those dashboards
every quarter.
You need to know it’s coming and build for it however. For example, Toast
is extremely focused on the efficiency of their processes.
Not only do we listen to the business needs and obviously support them,
but we also look internally and address scalability,” said Toast data
engineer Angie Delatorre. “If a job used to take one hour, and now it takes
three hours, we always need to go back and look at those instances, so that
shapes our OKRs as well.
“First, get all of your data in one place with the highest fidelity. Just get the
raw data in there. Second, come up with repeatable pipelines for getting
data to your analysts. You don’t want to go back to the raw data every time
you want to do something.”
Former Uber data product manager Atul Gupte has discussed how critical it
is when iterating data products to understand, “How to prioritize your
product roadmap, as well as who you need to build for (often, the
engineers) versus design for (the day-to-day platform users, including
analysts).”
Architecture Considerations
External data products, like internal products, can leverage a wide variety
of data cloud services to serve as the foundation of their platform including
data lakes or warehouses.
Multi-Tenancy
Many however will leverage a solution like Snowflake for its ability to
optimize how relational data can be stored and queried at scale. Nothing
new there.
What is new is that this will likely be your team’s first discussion about
multi-tenant architectures. That is a big change and decision point when
serving external customers.
Of course, you can make other architecture choices across your stack to
mitigate these trade-offs.
That means data never leaves their environment. PII and sensitive data is
abstracted away and what we pull are the non-sensitive logs, metric
aggregations needed to assess their data systems’ health.
For example, Snowpark now supports Python and larger memory instances
for machine learning. So for example, a developer can use Python in
Snowpark to build out a predictive learning application for the marketing
team to experiment with different advertising channels and how that will
impact their ROI. They could then use Streamlit to easily build the
application and visualization of their data science model for their
consumers to leverage from those blocks of code before then publishing
and monetizing it on the Snowflake Marketplace.
Choozle
That equation changed and a data quality issue arose when Choozle
released its massively powerful unified reporting capability, which allows
users to connect external media sources.
Adam ruled out open source testing solutions due to the associated
maintenance costs.
“I understand the instinct to turn to open source, but I actually have a lower
cost of ownership with a tool like Monte Carlo because the management
burden is so low and the ecosystem works so well together. After one phone
call with the Monte Carlo team, we were connected to our data warehouse,
and we had data observability a week later,” said Adam.
“I love that with Snowflake and Monte Carlo my data stack is always up-to-
date and I never have to apply a patch. We are able to reinvest the time
developers and database analysts would have spent worrying about
updates and infrastructure into building exceptional customer
experiences.”
Monte Carlo gave the Choozle team deeper visibility into data product
issues that otherwise may not have been proactively caught.
“We see about 2 to 3 real incidents every week of varying severity. Those
issues are resolved in an hour whereas before it might take a full day,” said
Adam. “When you are alerted closer to the time of the breakage it’s a
quicker cognitive jump to understand what has changed in the
environment.”
the Choozle team realized immediate results when they recently brought in
a new stream of media into production. The team had missed a field
containing the primary key that uniquely identifies each table record.
“We expect that field to not be null 100% of the time. We started getting a
handful of campaigns, about .2%, where that value was null,” said Adam.
“That led us to look at the situation and address it before anyone noticed
there were certain campaigns with certain views not seeing any data. That
type of problem could have magnified over time and filled tables with all
kinds of junk. With Monte Carlo, our time-to-detection in this case was
accelerated from days to minutes.”
Adam’s advice for other data professionals is to be bold: the risk of
innovation has become greatly mitigated over time with the rise of cloud
data warehouses like Snowflake and SaaS platforms like Monte Carlo.
“A decade ago when I was running warehouses on-prem, it was very hard to
innovate. Every answer to a new suggestion was a no because you don’t
want to break the whole warehouse. I encourage my developers to be more
aggressive. You aren’t going to break Snowflake by writing a bad query,
and if you did we would know it right away,” said Adam. “It’s the same with
Monte Carlo. If you already have Snowflake, there is almost no risk in trying.
The cost to get set up was the salary for three people for an hour and we
saw value immediately."
Toast
Toast is a leading point of sale provider for restaurants and a recent unicorn
with a thoughtful data team focused on building out their data platform.
The Toast data team began showing how their data platform and products
provide value to the business by first understanding the business problems
affecting their colleagues.
“Our director gets our team involved early into these problems and helps us
understand how we’re going to solve it, how we’re going to measure that
we’re solving it correctly, and how we’re going to start to track that data—
not just now but also in the future,” said Noah Abramson, Data Engineering
Manager, Toast.
“[Our process was originally] super centralized, and we owned the entire
stack,” explained Noah. “As the company started to grow, it got
overwhelming. We pivoted to a self-service model, and we became the
team that you would consult with as you were building these dashboards
and owning the data.”
At Toast, the data platform team owns the company’s external-facing data
insights and analytics.
“One of our big value ads [as an organization] is giving business insights to
our customers: restaurants,” says Noah. “How did they do over time? How
much were their sales yesterday? Who is their top customer? It’s the data
platform team’s job to engage with our restaurant customers.”
“We say our customers are all Toast employees,” he says. “We try to
enable all of them with as much data as possible. Our team services all
internal data requests from product to go-to-market to customer support to
hardware operations.”
It’s thus Noah’s team’s job to build out data flows into overarching systems
and help stakeholders across the organization derive insights from tools
including Snowflake and Looker.
“We really listen to what the business needs,” says Noah regarding how his
team thinks about measuring data-related KPIs. “At the top level, they
come up with a couple different objectives to hit on: e.g., growing
customers, growing revenue, cutting costs in some spend area.”
Noah and his team then take these high-level business objectives and use
them to build out Objectives and Key Results (OKRs).
We’re able to do that in a few ways,” says Noah. “If you think about growing
the customer base, for example, we ask, ‘how do we enable people with
data to make more decisions?’ If somebody has a new product idea, how do
we play with that and let them put it out there and then measure it?”
“There’s a lot of moving parts,” says Noah of the data stack. “There’s a lot
of logic in the staging areas and lots of things that happen. So that kind of
begs the question, how do we observe all of this data? And how do we make
sure when data gets to production and Looker that it is what we want it to
be, and it’s accurate, and it’s timely, and all of those fun things that we
actually care about?”
Originally, Noah and two other engineers spent a day building a data
freshness tool they called Breadbox. The tool could conduct basic data
observability tasks including storing raw counts, storing percent nulls,
ensuring data would land in the data lake when expected, and more.
“That was really cool,” notes Noah, “but as our data grew, we didn’t keep
up. As all of these new sources came in and demanded a different type of
observation, we were spending time building the integration into the tool
and not as much time in building out the new test for that tool.”
Once the team reached that pivotal level of growth, it was time to consider
purchasing a data observability platform rather than pouring time and
resources into perfecting its own.
“With Monte Carlo, we got the thing up and running within a few hours and
then let it go,” says Noah. “We’ve been comparing our custom tool to
Monte Carlo in this implementation process. We didn’t write any code. We
didn’t do anything except click a few buttons. And it’s giving us insights
that we spent time writing and building or maintaining.”
Fox
Fox is more than just a news, media and sports powerhouse; they’re also
behind one of the media industry’s most advanced data architectures.
Here’s how the early adopters of AWS Redshift, Kinesis and Apache Spark
are democratizing data across the organization with data observability in
mind.
Data often gets stuck in silos, with requests backing up in ticket queues
that never reach overworked data engineers and analysts struggling to
serve the needs of their entire organization.
Here’s how his team architected a hybrid data architecture that prioritizes
democratization and access, while ensuring reliability and trust at every
turn.
“If you think about a centralized data reporting structure, where you used to
come in, open a ticket, and wait for your turn, by the time you get an
answer, it’s often too late,” Alex said. “Businesses are evolving and
growing at a pace I’ve never seen before, and decisions are being made at a
blazing speed. You have to have data at your fingertips to make the correct
decision.”
To accomplish this at scale, Alex and his centralized data team control a
few key areas: how data is ingested, how data is kept secure, and how data
is optimized in the best format to be then published to standard executive
reports. When his team can ensure data sources are trustworthy, data is
secure, and the company is using consistent metrics and definitions for
high-level reporting, it gives data consumers the confidence to freely
access and leverage data within that framework.
Under Alex’s leadership, five teams oversee data for the Fox digital
organization: data tagging and collections, data engineering, data
analytics, data science, and data architecture. Each team has its own
responsibilities, but everyone works together to solve problems for the
entire business.
“I strongly believe in the fact that you have to engage the team in the
decision-making process and have a collaborative approach,” said Alex.
“We don’t have a single person leading architecture—it’s a team chapter
approach. The power of the company is, in essence, the data. But people
are the power of that data. People are what makes that data available.”
(It’s worth noting, however, that this decentralized approach won’t work for
every organization, and the needs of your team structure will vary based on
the SLAs your company sets for data.)
Under Alex’s leadership, five teams oversee data for the Fox digital
organization: data tagging and collections, data engineering, data
analytics, data science, and data architecture. Each team has its own
responsibilities, but everyone works together to solve problems for the
entire business.
“I strongly believe in the fact that you have to engage the team in the
decision-making process and have a collaborative approach,” said Alex.
“We don’t have a single person leading architecture—it’s a team chapter
approach. The power of the company is, in essence, the data. But people
are the power of that data. People are what makes that data available.”
(It’s worth noting, however, that this decentralized approach won’t work for
every organization, and the needs of your team structure will vary based on
the SLAs your company sets for data.)
I’ve been in data for over a decade, and I can say under no uncertain terms
that Fox has one of the most robust and elegant data tech stacks that I’ve
ever seen. But Alex is adamant that data leaders shouldn’t pursue shiny
new tech for its own sake.
The Fox data team built their tech stack to meet a specific need: enabling
self-service analytics. “We embarked on the journey of adopting a
lakehouse architecture because it would give us both the beauty and
control of a data lake, as well as the cleanliness and structure of a data
warehouse.”
Several types of data flow into the Fox digital ecosystem, including
batched, micro-batched, streaming, structured, and unstructured. After
ingestion, data goes through what Alex refers to as a “three-layer cake”.
“First, we have the data exposed at its raw state, exactly how we ingest it,”
said Alex. “But that raw data is often not usable for people who want to do
discovery and exploration. That’s why we’re building the optimized layer,
where data gets sorted, sliced-and-diced, and optimized in different file
formats for the speed of reading, writing, and usability. After that, when we
know something needs to be defined as a data model or included in a data
set, we engage in that within the publishing layer and then build it out for
broader consumption within the company. Inside of the published layer,
data can be exposed via our tool stack.”
The optimized layer makes up the pool of data that Alex and his team
provide to internal stakeholders under the “controlled freedom” model.
With self-serve analytics, data users can discover and work with data
assets that they already know are trustworthy and secure.
“If you don’t approach your data from the angle that it’s easy to discover,
easy to search, and easy to observe, it becomes more like a swamp,” said
Alex. “We need to instill and enforce some formats and strict regulations to
make sure the data is getting properly indexed and properly stored so that
people can find and make sense of the data.”
To Make Analytics Self-Serve, Invest in Data Trust
For this self-serve model to work, the organization needs to have trust that
the data is accurate, reliable, and trustworthy. To help achieve this goal, the
entire data stack is wrapped in QA, validation, and alerting. Fox uses Monte
Carlo to provide end-to-end data observability, along with Datadog, Cloud
Watch Alerts, and custom frameworks to help govern and secure data
throughout its lifecycle.
“Data observability has become a necessity, not a luxury, for us,” said Alex.
“As the business has become more and more data-driven, nothing is worse
than allowing leadership to make a decision based upon data that you don’t
have trust in. That has tremendous costs and repercussions.”
Alex estimates that the Fox digital organization receives data multiple
times a day from over 200 sources. They process nearly 10,000 schemas
and tens of billions of records per week. “You can’t scale the team to
maintain and support and validate and observe that amount of data. You
have to have at least a few tools at your disposal. For us to make sure that
we have trust in the data’s timeliness, completeness, and cleanliness, tools
like Monte Carlo are “must-to-have” . It’s been a great addition to allow us
to build an AI-powered overview of what’s happening in our data stack.”
The continual monitoring and alerting Monte Carlo provides, along with
automated data lineage, helps Alex’s team to be more proactive about data
incidents when they do occur. “We can catch the issues before they hit
production and if they do, we know the level of impact by using reverse-
engineering to see how many and what kind of objects have been involved,
and we can stop it in-flight before it causes a massive impact downstream.
It all comes with trust—the moment you drop transparency or start hiding
things, people lose trust and it’s really hard to regain it back. I’ve learned
that no matter what happens, if you’re being honest and you’re owning the
problem, people tend to understand and give you another chance to fix it.”
With the right tech, the right people, and the right processes in place, Alex
and his teams have earned the trust required to build a self-serve data
platform that powers decisions on a daily basis.
While this guide may have read like a lot of work and a list of reasons why
you shouldn’t build a data product, we hope it helps demystify the
challenges associated with this daunting but worthwhile endeavor.
You likely will not build the perfect data product on your first sprint (few do),
but we encourage you to build, ship, iterate, rinse, and repeat.
You also don’t have to undergo this ordeal alone. If you are interested in
improving data quality and building awesome data products your company
will actually use, talk to us. We’ve navigated this journey with hundreds of
data teams and can help ensure you reach your final destination.
Additional Resources
Don’t let this be the ending point of your data quality journey! Check out
more helpful resources including:
Data Downtime Blog: Get fresh tips, how-tos, and expert advice on
all things data.
Data Observability Product Tour: Check out this video tour showing
just how a data observability platform works.