Microsoft Fabric - James Serra - Public
Microsoft Fabric - James Serra - Public
Microsoft Fabric - James Serra - Public
James Serra
Industry Advisor
Microsoft, Federal Civilian
[email protected]
6/16/23
About Me
▪ Microsoft, Data & AI Solution Architect in Microsoft Federal Civilian
▪ At Microsoft for most of the last nine years as a Data & AI Architect , with a brief stop at EY
▪ In IT for 35 years, worked on many BI and DW projects
▪ Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
▪ Been perm employee, contractor, consultant, business owner
▪ Presenter at PASS Summit, SQLBits, Enterprise Data World conference, Big Data Conference
Europe, SQL Saturdays, Informatica World
▪ Blog at JamesSerra.com
▪ Former SQL Server MVP
▪ Author of book “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse,
Data Fabric, Data Lakehouse, and Data Mesh”
My upcoming book First two chapters available now:
Deciphering Data Architectures (oreilly.com)
- Foundation
- Big data
- Types of data architectures
- Architecture Design Session
- Common data architecture concepts
- Relational Data Warehouse
- Data Lake
- Approaches to Data Stores
- Approaches to Design
- Approaches to Data Modeling
Table of contents - Approaches to Data Ingestion
- Data Architectures
- Modern Data Warehouse (MDW)
- Data Fabric
- Data Lakehouse
- Data Mesh Foundation
- Data Mesh Adoption
- People, Process, and Technology
- People and process
- Technologies
- Data architectures on Microsoft Azure
Agenda
▪ What is Microsoft Fabric?
▪ Workspaces and capacities
▪ OneLake
▪ Lakehouse
▪ Data Warehouse
▪ ADF
▪ Power BI / DirectLake
▪ Resources
▪ Not covered:
▪ Real-time analytics
▪ Spark
▪ Data science
▪ Fabric capacities
▪ Billing / Pricing
▪ Reflex / Data Activator
▪ Git integration
▪ Admin monitoring
▪ Purview integration
▪ Data mesh
▪ Copilot
Microsoft Fabric does it all—in a unified solution
An end-to-end analytics platform that brings together all the data and analytics tools that
organizations need to go from the data lake to the business user
Data Integration Data Engineering Data Warehouse Data Science Real Time Analytics Business Intelligence Observability
UNIFIED
SaaS product experience Security and governance Compute and storage Business model
Microsoft Fabric
The data platform for the era of AI
Single…
Onboarding and trials
Data Synapse Data Synapse Data Synapse Data Synapse Real Data
Sign-on
Power BI
Factory Engineering Science Warehousing Time Analytics Activator Navigation model
UX model
AI Assisted Workspace organization
Shared Workspaces
Collaboration experience
Data Lake
Universal Compute Capacities Storage format
OneSecurity
Data copy for all engines
Security model
OneLake CI/CD
Monitoring hub
The Intelligent data foundation Data Hub
Governance & compliance
SaaS
“it just works"
Success Centralized
5x5
by Default administration
Centralized security
Instant Provisioning Auto optimized
management
Data Stewards
• Maintain visibility and control of costs with a unified consumption and cost model that provides evergreen spend optics on your end-to-end data estate
• Gain full visibility and governance over your entire analytics estate from data sources and connections to your data lake, to users and their insights
Workspaces and capacities
Company examples
Create fabric capacity
Capacity is a dedicated set of resources reserved for exclusive use. It offers dependable,
consistent performance for your content. Each capacity offers a selection of SKUs, and
each SKU provides different resource tiers for memory and computing power. You pay
for the provisioned capacity whether you use it or not.
Once the capacity is created, we can see the capacity on the Admin portal- Capacity Settings pane under the "Fabric Capacity" tab
Create fabric capacity
Turning on Microsoft Fabric
Enable Microsoft Fabric for your organization - Microsoft Fabric | Microsoft Learn
Demo
OneLake
OneLake for all data 2
“The OneDrive for data”
OneLake
A single unified logical SaaS data lake for
the whole organization (no silos) Data Synapse Data Synapse Data Synapse Data Synapse Real Data
Power BI
Factory Warehousing Engineering Science Time Analytics Activator
Organize data into domains
Foundation for all Fabric data items
Provides full and open access through
industry standard APIs and formats to any
application (no lock-in)
One Copy
Workspace A Workspace B
Use OneLake with existing data lakes Use and land data directly in OneLkae
OneLake Data Hub
Discover, manage and use data in one place
Transform
Notebooks &
Dataflows
Lakehouse – Lakehouse mode
Managed
Unmanaged
Double-click file to view it
Table - This is a virtual view of the managed area in your lake. This is the main container to host
tables of all types (CSV, Parquet, Delta, Managed tables and External tables). All tables, whether
automatically or explicitly created, will show up as a table under the managed area of the Lakehouse.
This area can also include any types of files or folder/subfolder organizations.
Files - This is a virtual view of the unmanaged area in your lake. It can contain any files and
folders/subfolder’s structure. The main distinction between the managed area and the unmanaged
area is the automatic delta table detection process which runs over any folders created in the
managed area. Any delta format files (parquet + transaction log) will be automatically registered as a
table and will also be available from the serving layer (TSQL)
Automatic Table Discovery and Registration
Lakehouse Table Automatic discovery and registration is a feature of the lakehouse that provides a fully managed
file to table experience for data engineers and data scientists. Users can drop a file into the managed area of the
lakehouse and the file will be automatically validated for supported structured formats, which is currently only
Delta tables, and registered into the metastore with the necessary metadata such as column names, formats,
compression and more. Users can then reference the file as a table and use SparkSQL syntax to interact with the
data. So don’t need to explicitly call CREATE TABLE statement to create tables to use with SQL
Lakehouse – SQL endpoint mode
NOTE: “Warehouse mode” was renamed “SQL endpoint”
Visual Query
Lakehouse – shortcuts (to lakehouse)
Workspaces and capacities accessing OneLake
Workspace A Capacity A
OneLake
Lakehouse
Workspace C
Each tenant will have only one OneLake, and any tenant can
access files in a OneLake from other tenants via shortcuts
Demo
Data Warehouse
Data warehouse
Transform
Procedures
Synapse Data Warehouse
Infinitely scalable and open
Workspace A Capacity A
OneLake
Warehouse
Workspace C
Each tenant will have only one OneLake, and any tenant can
access files in a OneLake from other tenants via shortcuts
Data Warehouse
Python R Scala
Standard View
Diagram View
Standard View
PQ UI with the power of ADF (think New interface, but basically same We currently have multiple Dataflows experiences with Power BI Dataflows
of it as the next version of ADF PQ). as ADF
Scale is still Excel/PBI scale, not yet Gen1, Power Query Dataflows and ADF Data flows. What is the strategy with
ADF cloud scale Fabric with these various experiences?
Our goal is to evolve over time with a single Dataflow that combines the ease of
ADF Data flows do not exist in Fabric use of PBI, Power Query and the scale of ADF
Power Query is now called Dataflow Gen2 (which helps in
that Power Query does more than just query). Scalable What is Fabric Pipelines?
Power BI Dataflows are now called Dataflows Gen1
Mounting option available to use ADF mapping data flows
in Fabric (no option for Synapse yet). Can then do
Fabric pipelines enable powerful workflow capabilities at cloud-scale. With data
changes in Fabric (but not in ADF) pipelines, you can build complex workflows that can refresh your dataflow, move
PB-size data, and define sophisticated control flow pipelines. Use data pipelines to
build complex ETL and Data factory workflows that can perform a number of
different tasks at scale. Control flow capabilities are built into pipelines that will
allow you to build workflow logic which provides loops and conditional.
Power BI / DirectLake
For best performance you should compress the data using Benefits:
No more scheduled imports
the VORDER compression method (50%-70% more
compression). Stored this way by ADF by default
Should I use Fabric now?
Yes, for prototyping
Yes, if you won’t be in production for several months
You have to be OK with bugs, missing features, and possible performance issues
Don’t use if have hundreds of terabytes
If building in Synapse, how to make transition to Fabric smooth?
Do not use dedicated pools, unless needed for serving and performance
Don’t use any stored procedures to modify data in dedicated pools
Use ADF for pipelines and for PowerQuery, and don’t use ADF mapping data flows. Don’t use
Synapse pipelines or mapping data flows
Embrace the data lakehouse architecture
Resources
Microsoft Fabric webinar series: https://aka.ms/fabric-webinar-series
Data Mesh, Data Fabric, Data Lakehouse – (video from Toronto Data Professional Community on 2/15/23)
Build videos:
Build 2-day demos
Microsoft Fabric Synapse data warehouse, Q&A
Fabric notes