Snowflake To Lakehouse Migration Assessment 5-23

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

Migration Assessment &

Strategy Workshop

1
Tech leaders are to the right of the Data Maturity Curve
From hindsight to foresight
Supporting the business Automated
Decision
Making

Prescriptive
Analytics
Competitive Advantage

Predictive
Modeling Automatically make the best
decision
Data
Exploration
How should we
Ad Hoc respond?
Queries
What will
Reports
Clean happen?
Data

What happened? Powering the business

Data + AI Maturity 2
Why not use a cloud data warehouse?
Data Warehouses can’t fully support all your data

Structured

Semi-Structured

Unstructured

Limited support for


unstructured data
(audio/images/video)
ADLS S3 GCS

DATA LAKE
Lock-in / proprietary
Data is replicated format
Why not use a cloud data warehouse?
Data Warehouses are inefficient with data transformation
Up to 6x the cost for ELT
workloads
Not optimized for
Data Engineering

Structured

Complex & many stages.

Semi-Structured

Limited support
for streaming

Unstructured

Limited support for


unstructured data
(audio/images/video)
ADLS S3 GCS

DATA LAKE
Lock-in / proprietary
Data is replicated format
Why not use a cloud data warehouse?
Pay a premium for all workloads
Up to 6x the cost for ELT BI Reports, Dashboards & SQL
workloads Compute cost for all
Not optimized for data access ELT
Data Engineering

Structured

Complex & many stages.

Semi-Structured Incompatible security


and governance models
Limited support
for streaming

Unstructured

Limited support for


unstructured data
(audio/images/video)
ADLS S3 GCS

DATA LAKE
Lock-in / proprietary
Data is replicated format
Why not use a cloud data warehouse?
A cloud data warehouse is NOT a modern data platform
Up to 6x the cost for ELT BI Reports, Dashboards & SQL
workloads Compute cost for all
Not optimized for data access ELT
Data Engineering

Structured

Complex & many stages.

Semi-Structured Incompatible security Model Serving


and governance models
Limited support
for streaming Not optimized for
Data Science

Unstructured

Limited support for


unstructured data Data is duplicated
(audio/images/video)
ADLS S3 GCS ADLS S3 GCS Data Science Model Training
DATA LAKE DATA LAKE
Lock-in / proprietary Model Scoring Model Deployment
Data is replicated format
Disparate tooling decreases
data team productivity
The Databricks Lakehouse Platform
✓ Single Source of truth for all
your data Integrated and collaborative
role-based experiences with open APIs

✓ End-to-end ETL and Data engineering /


BI & SQL Data science & ML
Streaming capabilities Streaming

✓ High performance BI on your Common security, governance,


and administration
data lake
✓ First-class AI/ML capabilities Data processing and management built on open source and open
standards
and support

✓ Open, unified governance and Cloud Data Lake


security structured, semi-structured, and unstructured data
Simple Migration Process

Inefficient DE, ETL, Data Sharing Delta Lake / Live Tables / Sharing
(Snowpipe / Snowpark, Partner Integrations) (Industry-leading data engine, multiple language support)

Limited Real-Time Event Processing Databricks Jobs / Delta Lake / Structured


(Batch Processing in Snowpipe) Streaming
(Spark Structured Streaming + Delta Lake: Streaming + Batch ingest)
Slower, More Expensive SQL / BI Databricks SQL
(SQL UI, BI Connectors) (Native SQL support, SQL Endpoints, Optimized BI Integrations)

Non-Native DS / ML Databricks Machine Learning / MLFlow


(Dataiku, DataRobot, External Snowpark Scripts) (Data native, first-class DS / ML platform, multiple language support)

Required Partner Tools Open Architecture


(Filling in the Gaps) (Easy integration with all required tools, access to open-source community)
Migration Methodology

Phase 4
Phase 2 Phase 3 Phase 5
Phase 1 Discovery TurnKey Delivery
Assessment Strategy Execution
Proposal

Reference
Assessment, Technology implementation of a
Migration specific
Design, Tooling, mapping, migration production use case, Migration execution
discovery and
Accelerators, workshop, Overall migration and support
consultation
Sizing, Partners migration planning implementation
plan

Databricks Migration Team with/without Partner Databricks Partner Driven

Databricks PS Driven ( Assurance Package to


assist SIs)
9
Discovery & Assessment

10
Architectural Discovery - Snowflake
Begin with a review of the customer’s current Snowflake architecture

❏ ETL Environment
❏ Third-party Tools? (Fivetran, Talend, dbt, etc.)
❏ Data velocity
❏ Data Types Snowflake Migration Scoping
❏ BI Processes Questionnaire (leave behind)
❏ BI Tools
❏ Report/Dashboard requirements
❏ ML Use Cases
❏ Scale of use cases
❏ Model requirements (languages, libraries, compute)
❏ Use our Snowflake Profiler to understand current workloads; Partner Analyzers for deeper dive

11
Pointers for discovery of the current Snowflake architecture

• Understand the landscape


• Existing architecture
• Pain points? (ETL, DSML, streaming, etc.)
• Upstream and downstream teams / partners
• Consider the TCO of their architecture
• Inefficiencies? Data movement or copying?
• Look for unserved use cases or teams
• Current use cases not possible in Snowflake’s Data Cloud
• Future use cases you could serve in the Lakehouse

12
The Snowflake Profiler is an important step
The Snowflake Profiler is a notebook which
runs in your environment to answer core
questions:

● What is the breakdown of my


Snowflake usage by category?
● Where are these workloads running?
(Which warehouse, which users?)
● How do we expect these costs to
grow over time?

Reach out to your Partner Account Manager, Sales Leader, or email [email protected] to
engage your Databricks counterparts.
Align on Target Architecture
Designing a Well Architected Lakehouse

14
Guiding Principles for the Lakehouse
Curate Data and Offer Trusted Data- Adopt an Organization-wide Data
as-Products Governance Strategy

Remove Data Silos and Minimize Encourage the Use of Open


Data Movement Interfaces and Open Formats

Democratize Value Creation through Build to Scale and Optimize for


Self-Service Experience Performance & Cost

15
Cloud Data Analytics Framework
Data Engineer ML Engineer Data Scientist Business Analyst Business Partners

ETL & DS tools BI Tools Consumption

Data
Workflow Ingest & Advanced Analytics, Data Sharing
Engine
Mgmt Transform ML & AI Warehouse

Data Governance Governance

Cloud Storage Storage

16
Cloud Data Analytics Framework
Data Engineer ML Engineer Data Scientist Business Analyst Business Partners

SQL
IDE support ETL & DS tools Notebooks
Editor BI Tools Consumption
Delta
Sharing
Workflows Auto Loader,
(Jobs, DLT) DLT SQL
ETL runtime ML runtime Model Serving Data
Connectors
Workflow Ingest & Advanced Analytics, Databricks Data SQL Sharing
Batch Engine
Mgmt Transform ML & AI Warehouse
Warehouse
Runtimes
Data Quality Streaming
Photon
(DLT)

Unity
Catalog Data Governance Governance

Amazon ADL Google Cloud


S3 S Storage Cloud Storage Proprietary
Supported: Storage
DWH files
Delta Parquet JSON CSV Avro Images ...

17
Databricks Lakehouse Reference Architecture
The Databricks Lakehouse Platform can support your core data workloads

Optional DW

18
Different Data Models on the Lakehouse
The Lakehouse supports any data model

Data Models are organizational processes. You cannot buy a process!

However, the right platform capabilities will ease the implementation of a particular data model
● Databricks Lakehouse has the technical enablers for teams to produce and consume data in a centralized or
decentralized but governed way
● Lakehouse is a polyglot technology that works with any data modeling concept
● Lakehouse applies at all scales (startups to large orgs)
● The Medallion architecture can fit into whatever strategy you want

BRONZE SILVER GOLD


Replica of Source mainly for landing, archiving, re- More normalized (2NF-like) / Integration Layer
processing and Lineage Purposes Data Vault-like / Write-optimized Star schema/ Kimball / Read-Optimized

19
animated

A Well Architected Lakehouse


Dimensional Modeling
1. Staging Staging Enterprise ODS/ Integration Presentation
a. Raw data in its original format (temporarily) raw data
(temp.) Data Mart
2. Ingestion IoT, Social Media
Product Domain Data Dim.
a. Raw data converted to Delta (from Avro, CSV, Landing (Autoloader)
Model
parquet, XML, JSON format in Landing) Customer Sales
3. Integration -Physical data model Domain Data Mart
a. Detailed information covering multiple subject Ingestion Dim.
areas raw Physical Data Model
Model
data
b. Integrates all data sources
c. Does not necessarily use a dimensional model but bronze silver gold
feeds dimensional models.
4. Data Mart
a. Subset of the Integrated layer, sometimes or
aggregated data
Dimensional model (star schema)
b. Focus on dimensional modeling with star schema
c. Typically oriented to a specific business line or Order
team Customer Dim Fact Dim Product

Implementing Data Modeling Techniques in the Databricks Lakehouse Platform Dim


Time
Data Modeling Best Practices in the Databricks Lakehouse Platform
animated

A Well Architected Lakehouse


The Data Vault 2.0
1. Staging Staging Enterprise ODS/ Integration Presentation
a. Raw data in its original format (temporarily) raw data
(temp.) Raw Business Information
2. Ingestion: Mart
Vault Vault
a. Raw data converted to Delta (from Avro, CSV, Landing
parquet, XML, JSON format in Landing) Hub PIT
Business
3. Integration - Raw Vault: Data is modeled as
Link Bridge
Views SQ
a. Hubs (unique business keys) Ingestion L
b. Links (relationship and associations) raw Satellite Views
data
c. Satellites (descriptive data)
4. Integration - Business Vault: bronze silver gold
Tables with applied business rules, data quality rules,
cleansing and conforming rules ETL/
a. Business views Data Vault 2.0 model
b. Point-in-Time (PIT) tables (opt.)
ELT
Satellite
c. Bridge tables are created on top of the business
Satellite
vault (opt.)
Satellite
5. Presentation - Information Marts Satellite Hub Link Hub
a. Similar to a classical Data Mart with data that has Customer Product Satellite
Satellite
been cleansed and harmonized
b. Consumer-oriented models (typically views) Hub Order

Satellite Satellite
21
animated

A Well Architected Lakehouse


The Data Mesh

1. Data Domain ownership: The lakehouse provides


an open, flexible architecture that allows for
distributed ownership of data assets for each
domain
2. Data as a product: Delta provides an open and
standard format for FAIR data, and Delta Live
Tables allow for high quality, reliable data
pipelines
3. Self-service infrastructure platform: Databricks
is a unified platform that can automate data
processes through the use of Workflows,
Terraform, and other tools
4. Federated governance: Unity Catalog ensures
global data discovery, access, and lineage

22
* Batch process with CDF, DLT, … ** currently lineage is restricted per

You might also like