5 Best Practices For Data Warehouse Development
5 Best Practices For Data Warehouse Development
5 Best Practices For Data Warehouse Development
Cloud technology has revolutionized how businesses store, access, In this ebook, we outline five recommendations for putting structure
and analyze data. Whether your organization is creating a new data around your data strategy and getting alignment across your
platform from scratch or re-engineering a legacy data warehouse business, so the data warehouse you create meets both current and
system to take advantage of new capabilities, a handful of guidelines future business needs. These best practices for data warehouse
and best practices will help ensure your project’s success. Some of development will increase the chance that all business stakeholders
those best practices may seem obvious, but all too often, businesses will derive greater value from the data warehouse you create, as well
fail to spend time up front setting and documenting these decision as lay the groundwork for a broader enterprise data platform that
points, resulting in headaches and inefficiency down the road. can grow and adapt as your business needs change.
3
1. CREATE A
DATA MODEL
The first key step in any data program is to
create a data model: an abstract representation
City Lived
that organizes elements of data and describes # * Year
seeks to fulfill.
4
The data model also provides clear documentation of
the content, context, and sources. This makes it easier
to audit or to comply with new data requirements,
such as those presented by GDPR (the EU’s General
Data Protection Regulation framework that sets
guidelines for the collection and processing of
personal information from individuals).
Having a strong data model also helps prevent
confusion and costly reengineering down the road. It’s
always a good idea to incorporate a source-agnostic
integration layer that enables analysis across multiple
data sets based on the data sets’ commonalities.
A data warehouse brings together many different
sources and types of data, including traditional data
sets such as customer relationship management
(CRM) data and enterprise resource planning (ERP)
data, as well as data sets like blogs, Twitter feeds, IoT
data, and even data sets that have yet to be invented.
This is why having a flexible integration layer that isn’t
too tightly tied to any single system will help future-
proof your data warehouse.
A highly effective data model should employ definitions
and semantic structures that are defined by the
business domain, not by the specific definitions of any
single source system. For example, one CRM system
may refer to customers as “cust,” while another
refers to “cust_ID.” Establishing a business-wide
semantic rule for how users should name, access,
and analyze that data across data sets is key to the
data warehouse’s success.
Figure 2: An example data model using the Data Vault modeling approach
5
As a company goes through changes, mergers, and Each architecture has its advantages, but the choice
acquisitions, the CRM system it is using today may of which to adopt will depend on the business needs of
be replaced by a different CRM system. If your data the organization.
model is tightly coupled to a specific source system,
Frankly, more important than which architecture
then you will have to do a lot of reengineering to
your organization selects is that it selects,
integrate the second source system that replaces the
documents, and continually supports this
legacy system. A source-agnostic integration layer
architecture as part of developing a data model
makes data mapping much easier, so you can swap an
for the warehouse. Doing so will enable future
old source system for a new source system without
efficiency, allowing for a single support and
affecting downstream reports or having to change
troubleshooting methodology that will make it easier
user behavior.
for new team members to ramp up more quickly.
Within the data model, it’s critical to select a standard
approach. The main types of data modeling standards
used in data warehouse design include:
• 3NF: 3NF, which stands for “third normal form,”
is an architectural standard designed to reduce
the duplication of data and ensure referential
integrity of the database.1
• Star schema: The simplest and most widely used
architecture to develop data warehouses and
dimensional data marts, the star schema consists
of one or more fact tables referencing any number
of dimension tables.²
• Data Vault (DV): Developed specifically to address
agility, flexibility, and scalability issues found in
other approaches, DV modeling was created as a
granular, non-volatile, auditable, easily extensible,
historical repository of enterprise data. It is highly
normalized and combines elements of 3NF and
star models.³
6
2. ADOPT AN AGILE
DATA WAREHOUSE
METHODOLOGY
In the past, data warehouse (or even data mart) Within the Agile worldview, a variety of approaches To leverage the benefits of Agile development more
creation was a large, monolithic, multi-quarter, have emerged to help deliver value faster, including: fully, an Agile data platform is very helpful. Cloud-
or multi-year effort, subject to the traditional • Scrum—Named for the rugby formation in which
based data platforms provide that structural flexibility
forwards interlock arms and advance, Scrum is and elasticity, enabling rapid scaling as business
“waterfall” process. In the modern age, that’s
the most widely used process framework for Agile needs evolve. Cloud-based data platforms require
no longer the norm as many organizations are less effort, maintenance, and administration to be
development. A lightweight framework, Scrum
choosing to adopt a more flexible and iterative, useful, and they can grow and adapt to changing
emphasizes daily communication and the flexible
or Agile, design approach. business requirements. By leveraging a modern cloud
reassessment of plans that are carried out in short,
iterative phases of work.5 Ralph Hughes codified service, teams can spend less time tuning queries
With business needs changing faster than ever,
Scrum’s application to data warehousing in a series and provisioning storage and more time addressing
and new data sources coming online more quickly,
of seminal works that are useful to businesses immediate business challenges and delivering
businesses need to be able to adapt and leverage these
adopting this approach. business value.
inputs concisely and rapidly. That means learning to
build data and analytic solutions in an incremental • Kanban—Kanban is a method for managing the
Leveraging Agile methodologies and structures is no
and Agile fashion. With proper planning that aligns to creation of products with an emphasis on small undertaking. It requires a cultural commitment
a single source-agnostic integration layer, large data continuous delivery without overburdening within the organization and is often a significant
projects can now be broken down into smaller pieces the development team. Like Scrum, Kanban is a shift in mindset and workflow from traditional data
that can be delivered more frequently, thus providing process designed to help teams work together warehousing workflows. Retooling an IT team to work
incremental business value much more quickly. more effectively. Named for the “Kanban” cards comfortably in an Agile environment can take six to 12
that track production in a factory, Kanban was months, which may seem paradoxical given the Agile
Data warehousing architects are adopting the Agile
created by Taiichi Ohno, an industrial engineer at methodology’s goal of delivering value more quickly.
methodology, which first appeared in the software
Toyota, to improve manufacturing efficiency. This transition can be accelerated by engaging with
development world, to achieve this goal. In the Agile
a seasoned Agile coach. But once the shift is made,
methodology, requirements and solutions evolve • BEAM—BEAM, or Business Event Analysis and teams can begin to deliver new incremental changes to
through the collaborative effort of self-organizing and Modelling, was introduced by Lawrence Corr the data warehouse in weeks, instead of months.
cross-functional teams and customers. When applied and Jim Stagnitto in their groundbreaking work,
to data warehouse conception and construction, the Agile Data Warehouse Design. BEAM focuses on
Agile methodology enables businesses to activate business events, rather than on known reporting
new data sets and solve new business challenges requirements, to model the whole business
more quickly.4 process area. It leverages seven dimensional
types (the seven Ws: who, what, when, where,
how, how many, and why) to identify and then
elaborate on business events.6
7
3. FAVOR ELT
OVER ETL
In the past, data warehousing development In the ELT approach, raw data is extracted from
took an extract-transform-load (ETL) approach, the source and loaded, relatively unchanged, into
extracting the data to be imported into the the staging area of the data warehouse. Metadata,
load date, or source information may be added to
data warehouse from the source systems,
the data, and then it is brought directly into the
cleaning it or applying business rules to it on data warehouse. Once inside the data warehouse,
an external server, and then loading it into the businesses can use the power of the database to
target data warehouse. Increased data platform perform transformations, whether that’s changing the
computing power and capabilities have yielded structure of the data (that is, applying a data model),
a new preferred approach: extract-load- applying business rules, or performing data quality
transform (ELT). measures to cleanse the data (for example, correcting
incomplete addresses, standardizing data field names,
and resolving duplicates).
The ELT approach has two distinct advantages: cost
savings and greater traceability. ELT helps realize
cost savings as it allows businesses to leverage
the power of the data platform to transform
data, instead of using an external server. Cloud-
based computing power is typically much less
expensive than performing transformations and
data manipulation on an external server, so moving
data to the cloud directly is faster and cheaper. The
ELT approach also makes it easier to audit and trace
the data in the future, because it provides an image
of the original source data directly within the data
platform. In this way, the data warehouse itself can
play the role of what has come to be known as a
“data lake,” where raw data is stored persistently.
8
4. ADOPT AN
AUTOMATION TOOL
The goal of the data warehouse is to activate Automation allows businesses to leverage their
and deliver data more quickly so it can inform resources more fully, iterate faster, and enforce
business decisions and drive greater value. One coding standards more easily. It enables the creation
of standardized code, which is incredibly useful in
way to increase speed of delivery is to adopt
organizations where the ETL code and data models
the Agile methodology. Another is to adopt were traditionally developed by hand. Automation
automation tools that can help develop and provides a documented standard for these different
deploy code more quickly. Because many data artifacts, as well as an enforcement and quality
warehouse methodologies are pattern-based, assurance (QA) mechanism to monitor that all
the coding required for loading and structuring developers and designers are following that standard.
data is often repeatable, which means it can be Automation tools that use templates to generate
automated. A number of tools on the market code are especially helpful, because they enforce
automate some or even all of the design and standards by making them the preferred properties
build tasks, and the list grows daily. within the templates themselves. This makes
onboarding faster, as new developers and designers
will use these standard tools, guaranteeing consistent
implementation and shortening the learning curve.
A consistent implementation has the added benefit
of being easier to test and debug because code is
developed using the same standards.
Iteration also becomes faster by using these tools,
because automated code generators tend to not make
syntax errors. Updating code typically means adding
a new object to the tool or changing the templates’
properties at the global level, generating new code
that is immediately available for deployment in the
environment for testing and validation.
9
5. TRAIN YOUR STAFF
ON NEW APPROACHES
A move to the Agile methodology or Many industry resources are available to help
automated code development isn’t just a shift manage the transition to the Agile methodology.
in skill sets—it’s a shift in mindset. Training The Agile Alliance, a global nonprofit member
organization dedicated to promoting the concepts
and education are required to ensure the
of Agile software development as outlined in the
team is leveraging these new approaches Agile Manifesto, offers many training options for
and technologies effectively. This may mean introducing Agile concepts. The Scrum Alliance
bringing in external experts to train teams on offers certifications and training for foundational
the Scrum best practices or educating teams and advanced Scrum training. Likewise, Data Vault
on the benefits, rules, and best practices for bootcamp and certification is offered by selected
whatever standard architecture the business partners through the Data Vault Alliance.
has adopted for its data platform. As with any new process and cultural change,
organizations should manage the adoption curve to
ensure a consistent and effective shift to the new
approach in day-to-day operations. Identifying pilot
or proof-of-concept projects for initiating the teams
to the new approaches will ensure practitioners build
and master the skills in protected, yet real, scenarios
that will accelerate competence and abilities in these
new skills.
10
SUMMARY
All of the best practices outlined in this ebook As business requirements change and the desire to
require an upfront investment to achieve the gain more value from even more data and data types
long-term business value they can deliver. continues to accelerate, having these best practices
in place will allow you to think and grow beyond
But, the return on that investment is twofold:
the traditional data warehousing use cases. With
It will lay the foundation for a successful data a solid foundation and agile platform, you will be
analytics program at the outset and accelerate able to expand into new data realms and meet new
the successful delivery of incremental business demands by broadening the program to support data
value to your data environment long after the science, machine learning, AI, and maybe even data
first production release. monetization. With today’s flexible and scalable cloud
resources, there really is no limit to what you can
achieve with your data.
11
ABOUT SNOWFLAKE
Snowflake enables every organization to mobilize their data with Snowflake’s Data Cloud. Customers use the Data Cloud to unite siloed data,
discover and securely share data, and execute diverse analytic workloads. Wherever data or users live, Snowflake delivers a single data experience
that spans multiple clouds and geographies. Thousands of customers across many industries, including 543 of the 2022 Forbes Global 2000
(G2K) as of October 31, 2022, use Snowflake Data Cloud to power their businesses.
© 2023 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature and service names mentioned herein
are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. All other brand names or logos mentioned or used
herein are for identification purposes only and may be the trademarks of their respective holder(s). Snowflake may not be associated with, or be
sponsored or endorsed by, any such holder(s).
CITATIONS
1
en.wikipedia.org/wiki/Third_normal_form 4
Agiledata.org/essays/dataWarehousingBestPractices.html
2
en.wikipedia.org/wiki/Star_schema 5
scrum.org/resources/what-is-scrum
3
snowflake.com/blog/data-vault-modeling-and-snowflake 6
bisystembuilders.com/beam