Microsoft Fabric - James Serra - Public

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Microsoft Fabric

A unified analytics solution for the era of AI

James Serra
Industry Advisor
Microsoft, Federal Civilian
[email protected]
6/16/23
About Me
▪ Microsoft, Data & AI Solution Architect in Microsoft Federal Civilian
▪ At Microsoft for most of the last nine years as a Data & AI Architect , with a brief stop at EY
▪ In IT for 35 years, worked on many BI and DW projects
▪ Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
▪ Been perm employee, contractor, consultant, business owner
▪ Presenter at PASS Summit, SQLBits, Enterprise Data World conference, Big Data Conference
Europe, SQL Saturdays, Informatica World
▪ Blog at JamesSerra.com
▪ Former SQL Server MVP
▪ Author of book “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse,
Data Fabric, Data Lakehouse, and Data Mesh”
My upcoming book First two chapters available now:
Deciphering Data Architectures (oreilly.com)
- Foundation
- Big data
- Types of data architectures
- Architecture Design Session
- Common data architecture concepts
- Relational Data Warehouse
- Data Lake
- Approaches to Data Stores
- Approaches to Design
- Approaches to Data Modeling
Table of contents - Approaches to Data Ingestion
- Data Architectures
- Modern Data Warehouse (MDW)
- Data Fabric
- Data Lakehouse
- Data Mesh Foundation
- Data Mesh Adoption
- People, Process, and Technology
- People and process
- Technologies
- Data architectures on Microsoft Azure
Agenda
▪ What is Microsoft Fabric?
▪ Workspaces and capacities
▪ OneLake
▪ Lakehouse
▪ Data Warehouse
▪ ADF
▪ Power BI / DirectLake
▪ Resources

▪ Not covered:
▪ Real-time analytics
▪ Spark
▪ Data science
▪ Fabric capacities
▪ Billing / Pricing
▪ Reflex / Data Activator
▪ Git integration
▪ Admin monitoring
▪ Purview integration
▪ Data mesh
▪ Copilot
Microsoft Fabric does it all—in a unified solution
An end-to-end analytics platform that brings together all the data and analytics tools that
organizations need to go from the data lake to the business user

Data Integration Data Engineering Data Warehouse Data Science Real Time Analytics Business Intelligence Observability

Data Factory Synapse Synapse Synapse Synapse Power BI Data Activator

Unified data foundation


OneLake

UNIFIED

SaaS product experience Security and governance Compute and storage Business model
Microsoft Fabric
The data platform for the era of AI
Single…
Onboarding and trials
Data Synapse Data Synapse Data Synapse Data Synapse Real Data
Sign-on
Power BI
Factory Engineering Science Warehousing Time Analytics Activator Navigation model
UX model
AI Assisted Workspace organization
Shared Workspaces
Collaboration experience
Data Lake
Universal Compute Capacities Storage format
OneSecurity
Data copy for all engines
Security model
OneLake CI/CD
Monitoring hub
The Intelligent data foundation Data Hub
Governance & compliance
SaaS
“it just works"

5 seconds to signup, 5 minutes to wow

Success Centralized
5x5
by Default administration

Frictionless onboarding Minimal knobs Tenant-wide governance

Centralized security
Instant Provisioning Auto optimized
management

Quick results w/ Intuitive UX Auto Integrated Compliance built-in


Old vs New
Understanding Microsoft Fabric / FAQ
• Think of it as taking the PBI workspace and adding a SaaS version of Synapse to it
• You will wake up one day and PBI workspaces will be automatically migrated to Fabric workspaces: PBI
capacities will become fabric capacities. Your PBI tenant will have the Fabric workloads automatically built-
in
• Aligned to backend fabric capacity. Similar to Power BI capacity – specific amount of compute assigned to it.
A universal bucket of compute. No more Synapse DWU’s, Spark clusters, etc
• Serverless Pool and Dedicated Pool combined into one – no more relational storage or dedicated resources.
Everything is serverless. All about data lakehouse
• No Azure portal, subscriptions, creating storage. User won’t even realize they are using Azure
• Fabric has strong separation between person who buys and pays the bill, with person who builds stuff. In
Azure, the person building the solution has to also have the power to buy
• This is not just for departmental use. It’s not PaaS services (i.e., Synapse) vs Fabric. Fabric is the future.
Fabric is going to run your entire data estate: departmental projects as well as the largest data warehouse,
data lakehouses and data science projects
• One platform for enterprise data professional and citizen developer (next slide)
Data Engineers Data Scientists Data Analysts Data Citizens
• Execute faster with the ability to spin up • Quickly tune a custom model by • Avoid slow, progress-stagnating • Make more data-driven decisions
a Spark VM cluster in seconds, or integrating a model built and trained in data wrangling by seamlessly triggering with actionable insights and intelligence
configure with familiar experiences like Azure ML in a Spark notebook a workflow that can unlock data in your preferred applications
Git DevOps pipelines for data engineering tools and capabilities quickly.
• Work faster with the ability to user your • Maintain access to all the data you
engineering artifacts
preferred data science frameworks, • Accelerate your work with visual and need, without being overwhelmed by
• Streamline your work with a single languages, and tools SQL based tools for self-serve data data ancillary to your role thanks to fine
platform to build and operate real-time transformations and modeling as well as grain data access management controls
• Bypass engineering dependencies
analytics pipelines, data lakes, lake self-serve tools for reporting, dashboards,
with the ability to use your preferred no-
houses, warehouses, marts, and cubes and data visualizations
code ML Ops to deploy and operate
using your preferred IDE, plug-ins, and
models in production • Turn data into impact with industry-
tools.
leading BI tools and integration with the
• Tap into proven-at-scale models and
• Reduce costly data replication and apps your people use everyday like
services to accelerate your AI
movement with the ability to produce Microsoft 365
differentiation (AOAI, Cognitive Services,
base datasets that can serve data analysts
ONNX integration, etc).
and data scientists without needing to Serve data via Serve Serve insights
build pipelines warehouse or transformed via
lakehouse data embedding

Supporting experiences: Supporting experiences Supporting experiences Supporting experiences

Data Factory Data Warehouse


Data Real-time
Data Science Azure ML Power BI Power BI Microsoft 365
Warehouse analytics
Data Engineering Real-time analytics

Serve data via warehouse or lakehouse

Data Stewards
• Maintain visibility and control of costs with a unified consumption and cost model that provides evergreen spend optics on your end-to-end data estate
• Gain full visibility and governance over your entire analytics estate from data sources and connections to your data lake, to users and their insights
Workspaces and capacities
Company examples
Create fabric capacity

Capacity is a dedicated set of resources reserved for exclusive use. It offers dependable,
consistent performance for your content. Each capacity offers a selection of SKUs, and
each SKU provides different resource tiers for memory and computing power. You pay
for the provisioned capacity whether you use it or not.

A capacity is a quota-based system, and scaling up or down a capacity doesn't involve


provisioning compute or moving data, so it’s instant.
Create fabric capacity

Once the capacity is created, we can see the capacity on the Admin portal- Capacity Settings pane under the "Fabric Capacity" tab
Create fabric capacity
Turning on Microsoft Fabric

Enable Microsoft Fabric for your organization - Microsoft Fabric | Microsoft Learn
Demo
OneLake
OneLake for all data 2
“The OneDrive for data”

OneLake
A single unified logical SaaS data lake for
the whole organization (no silos) Data Synapse Data Synapse Data Synapse Data Synapse Real Data
Power BI
Factory Warehousing Engineering Science Time Analytics Activator
Organize data into domains
Foundation for all Fabric data items
Provides full and open access through
industry standard APIs and formats to any
application (no lock-in)

One Copy

One Security OneLake

OneLake Data Hub Intelligent data fabric


One Copy for all computes 4
Real separation of compute and storage

Compute powers the applications and


experiences in Fabric. The compute is
separate from the storage.
Data Synapse Data Synapse Data Synapse Data Synapse Real Data
Power BI
Factory Warehousing Engineering Science Time Analytics Activator
Multiple compute engines are available, and
all engines can access the same data without
needing to import or export it. You are able T-SQL Spark Serverless Compute Analysis
KQL
services
to choose the right engine for the right job.

Non-Fabric engines can also read/write


to the same copy of data using the
ADLS APIs or added through shortcuts Finance
Customer
360
Service
telemetry
Business
KPIs

Warehouse Lakehouse Lakehouse Warehouse

No matter which engine or item you use, Workspace A


OneLake Workspace B
everyone contributes to building the same lake.
Unified management and governance
Engines are being optimized to work with
Delta Parquet as their native format
Shortcuts virtualize data across domains and clouds
No data movements or duplication

A shortcut is a symbolic link which points


from one data location to another

Create a shortcut to make data from a Data


Factory
Synapse Data
Warehousing
Synapse Data
Engineering
Synapse Data
Science
Synapse Real
Time Analytics
Power BI
Data
Activator
warehouse part of your lakehouse

Create a shortcut within Fabric to consolidate


data across items or workspaces without
changing the ownership of the data. Data can be Unified management and governance
reused multiple times without data duplication.

Existing ADLS gen2 storage accounts and


Amazon S3 buckets can be managed Finance
Customer
360
Service
telemetry
Business
KPIs

externally to Fabric and Microsoft while still


being virtualized into OneLake with shortcuts
Warehouse Lakehouse
OneLake Lakehouse Warehouse

Workspace A Workspace B

All data is mapped to a unified namespace


and can be accessed using the same APIs Azure Amazon
including the ADLS Gen2 DFS APIs
OneLake Scenarios

Use OneLake with existing data lakes Use and land data directly in OneLkae
OneLake Data Hub
Discover, manage and use data in one place

Central location within Fabric to discover,


manage, and reuse data

Data can be easily discovered by its domain


(e.g. Finance) so users can see what matters
for them

Efficient data discovery using search, filter


and sort

Explorer capability to easily browse and find


data by its folder (workspace) hierarchy
Lakehouse
Lakehouse

Data Source Ingestion Store Expose

Shortcut Enabled Shortcuts Lakehouse(s) PBI

Structured / Pipelines & Lake Warehouse


Unstructured Dataflows

Transform
Notebooks &
Dataflows
Lakehouse – Lakehouse mode

Managed

Unmanaged
Double-click file to view it

Right-click –> Load to Delta table


Right click –> View table files

Table - This is a virtual view of the managed area in your lake. This is the main container to host
tables of all types (CSV, Parquet, Delta, Managed tables and External tables). All tables, whether
automatically or explicitly created, will show up as a table under the managed area of the Lakehouse.
This area can also include any types of files or folder/subfolder organizations.
Files - This is a virtual view of the unmanaged area in your lake. It can contain any files and
folders/subfolder’s structure. The main distinction between the managed area and the unmanaged
area is the automatic delta table detection process which runs over any folders created in the
managed area. Any delta format files (parquet + transaction log) will be automatically registered as a
table and will also be available from the serving layer (TSQL)
Automatic Table Discovery and Registration
Lakehouse Table Automatic discovery and registration is a feature of the lakehouse that provides a fully managed
file to table experience for data engineers and data scientists. Users can drop a file into the managed area of the
lakehouse and the file will be automatically validated for supported structured formats, which is currently only
Delta tables, and registered into the metastore with the necessary metadata such as column names, formats,
compression and more. Users can then reference the file as a table and use SparkSQL syntax to interact with the
data. So don’t need to explicitly call CREATE TABLE statement to create tables to use with SQL
Lakehouse – SQL endpoint mode
NOTE: “Warehouse mode” was renamed “SQL endpoint”

Can query tables (not files).


Cannot modify data SQL Query

Visual Query
Lakehouse – shortcuts (to lakehouse)
Workspaces and capacities accessing OneLake

Workspace A Capacity A
OneLake

Workspace B Capacity B Shortcut Sales

Lakehouse

Workspace C

Each tenant will have only one OneLake, and any tenant can
access files in a OneLake from other tenants via shortcuts
Demo
Data Warehouse
Data warehouse

Data Source Ingestion Store Expose

Shortcut Enabled Mounts Data Warehouse PBI

Structured / Pipelines & Warehouse


Unstructured Dataflows

Transform
Procedures
Synapse Data Warehouse
Infinitely scalable and open

Synapse Data Warehouse in Fabric 1 Open standard format in an open


data lake replaces proprietary
Data Data Data Data
formats as the native storage
Warehouse Warehouse Warehouse Warehouse
• First transactional data warehouse natively
embracing an open standard format

Relational Engine • Data is stored in Delta – Parquet with no


vendor lock-in
Infinite serverless compute • Is auto-integrated and auto-optimized with
minimal knobs

Open Storage Format • Extends full SQL ecosystem benefits


1
in customer owned Data Lake
Synapse Data Warehouse
Infinitely scalable and open

Synapse Data Warehouse in Fabric 2 Dedicated clusters are replaced by


serverless compute infrastructure
Data Data Data Data
Warehouse Warehouse Warehouse Warehouse • Physical compute resources assigned
within milliseconds to jobs

• Infinite scaling with dynamic resource


Relational Engine allocation tailored to data volume and
query complexity
2 Infinite serverless compute
• Instant scaling up/down with no physical
provisioning involved

Open Storage Format • Resource pooling providing significant


1
in customer owned Data Lake efficiencies and pricing
Workspaces and capacities accessing OneLake

Workspace A Capacity A
OneLake

Workspace B Capacity B Sales

Warehouse

Workspace C

Each tenant will have only one OneLake, and any tenant can
access files in a OneLake from other tenants via shortcuts
Data Warehouse

Use this to build a relational layer on top of the physical data


in the Lakehouse and expose it to analysis and reporting
tools using T-SQL/TDS end-point.

This offers a transactional data warehouse with T-SQL DML


support, stored procedures, tables, and views

How can I control “bad actor” queries?


Fabric compute is designed to automatically classify queries
to allocate resources and ensure high priority queries (i.e. ETL,
data preparation, and reporting) are not impacted by
potentially poorly written ad hoc queries.

How is the classification for an incoming query determined?


Queries are intelligently classified by a combination of the
source (i.e., pipeline vs. Power BI) and the query type (I.e.,
INSERT vs. SELECT)

Where is the physical storage for the Data Warehouse? All


data for Fabric is stored in OneLake in the open Delta format.
A single COPY of the data is therefore exposed to all the
compute engines of Fabric without needing to move or
duplicate data
Access via other tools
Demo
Microsoft Fabric

Synapse Data Synapse Data


Engineering Warehousing

Use Spark Notebooks Use SQL Queries &


Stored Procedures

Full T-SQL support

Python R Scala

Write data into Write data into


Lakehouse tables Warehouse tables

Courtesy Simon Whiteley, Advancing Analytics


Why two options?

Delta lake shortcomings:


- No multi-table transactions
- Lack of full T-SQL support (no
updates, limited reads)
- Performance problem for trickle
transactions
Microsoft Fabric

Courtesy Simon Whiteley, Advancing Analytics


Microsoft Fabric

Bronze Silver Gold


Lakehouse Lakehouse Warehouse

Courtesy Simon Whiteley, Advancing Analytics


ADF
ADF Review Mapping data flows Wrangling data flows

Standard View

Diagram View

Power BI: Dataflows


Synapse: No

Synapse: Pipelines Synapse: Data flows


ADF Review Mapping data flows Wrangling data flows

Standard View

Data Pipelines Don’t


Dataflow Gen2
Exist Diagram View

Power BI: Dataflows Dataflow Gen1


Synapse: No

Synapse: Pipelines Synapse: Data flows


Data Factory in Fabric
What is Dataflows Gen2?

This is the new generation of Dataflows Gen1. Dataflows provide a low-code


interface for ingesting data from 100s of data sources, transforming your data
using 300+ data transformations and loading the resulting data into multiple
ADF: Power Query ADF Pipelines destinations such as Azure SQL Databases, Lakehouse, and more

PQ UI with the power of ADF (think New interface, but basically same We currently have multiple Dataflows experiences with Power BI Dataflows
of it as the next version of ADF PQ). as ADF
Scale is still Excel/PBI scale, not yet Gen1, Power Query Dataflows and ADF Data flows. What is the strategy with
ADF cloud scale Fabric with these various experiences?

Our goal is to evolve over time with a single Dataflow that combines the ease of
ADF Data flows do not exist in Fabric use of PBI, Power Query and the scale of ADF
Power Query is now called Dataflow Gen2 (which helps in
that Power Query does more than just query). Scalable What is Fabric Pipelines?
Power BI Dataflows are now called Dataflows Gen1
Mounting option available to use ADF mapping data flows
in Fabric (no option for Synapse yet). Can then do
Fabric pipelines enable powerful workflow capabilities at cloud-scale. With data
changes in Fabric (but not in ADF) pipelines, you can build complex workflows that can refresh your dataflow, move
PB-size data, and define sophisticated control flow pipelines. Use data pipelines to
build complex ETL and Data factory workflows that can perform a number of
different tasks at scale. Control flow capabilities are built into pipelines that will
allow you to build workflow logic which provides loops and conditional.
Power BI / DirectLake
For best performance you should compress the data using Benefits:
No more scheduled imports
the VORDER compression method (50%-70% more
compression). Stored this way by ADF by default
Should I use Fabric now?
 Yes, for prototyping
 Yes, if you won’t be in production for several months
 You have to be OK with bugs, missing features, and possible performance issues
 Don’t use if have hundreds of terabytes
If building in Synapse, how to make transition to Fabric smooth?

 Do not use dedicated pools, unless needed for serving and performance
 Don’t use any stored procedures to modify data in dedicated pools
 Use ADF for pipelines and for PowerQuery, and don’t use ADF mapping data flows. Don’t use
Synapse pipelines or mapping data flows
 Embrace the data lakehouse architecture
Resources
Microsoft Fabric webinar series: https://aka.ms/fabric-webinar-series

New documentation: https://aka.ms/fabric-docs. Check out the tutorials.

Data Mesh, Data Fabric, Data Lakehouse – (video from Toronto Data Professional Community on 2/15/23)

Build videos:
Build 2-day demos
Microsoft Fabric Synapse data warehouse, Q&A

My intro blog on Microsoft Fabric (with helpful links at the bottom)

Fabric notes

Advancing Analytics videos

Ask me Anything (AMA) about Microsoft Fabric!


Q&A ?
James Serra, Microsoft, Industry Advisor
Email me at: [email protected]
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com

You might also like