Do 203
Do 203
Do 203
3. Availability Zones
4. OLTP vs OLAP
7. Partitioning Strategies
✅ Data Types
Data is a collection of facts such as numbers, descriptions, and observations used to
record information. We can classify data as structured, semi-structured, or unstructured.
Avro format works well with a message bus such as Event Hubs or Kafka that writes
multiple events/messages in succession. Also good for workloads having a lot of ETL
jobs, thus best for landing/raw zone.
It was developed for optimizing read and write operations in Apache Hive.
📌 TIP: If you still need to store the data in any of the semi-structured formats
such as CSV, JSON, XML, and so on, consider compressing them using
Snappy compression.
✅ Availability Zones
Availability zones are
physically separate data
centers within an Azure region.
Each availability zone is made
up of one or more data centers
equipped with independent
power, cooling, and There's a minimum of three availability zones
within a single region if applicable. However, not
networking. An availability
all regions have availability zones.
zone is set up to be an
isolation boundary. If one zone
goes down, the other continues
working.
✅ OLTP vs OLAP
Transactional Processing (OLTP) Analytical Processing (OLAP)
✅ Stream processing
After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is
then written to an output sink.
A batch layer (cold/slow path) stores all of the incoming data in its raw form and
performs batch processing on the data. The result of this processing is stored as
a batch view.
A speed layer (hot/fast path) analyzes data in real time. This layer is designed for
low latency, at the expense of accuracy.
2) Kappa Architecture
A drawback to the lambda architecture is its complexity. Processing logic appears in two
different places — the cold and hot paths — using different frameworks. This leads to
duplicate computation logic and the complexity of managing the architecture for both
paths.
In Kappa though, all data flows through a single path, using a stream processing
system.
✅ Partitioning
In many large-scale solutions, data is divided into partitions that can be managed and
accessed separately. Partitioning can improve scalability, reduce contention, and optimize
performance. It can also provide a mechanism for dividing data by usage pattern.
There are three typical strategies for partitioning data:
2) Vertical partitioning
In this strategy, each partition holds a subset of the fields for items in the data store. The
fields are divided according to their pattern of use. For example, frequently accessed
fields might be placed in one vertical partition and less frequently accessed fields in
another.
3) Functional partitioning
In this strategy, data is aggregated according to how it is used by each bounded context
in the system. For example, an e-commerce system might store invoice data in one
partition and product inventory data in another.
1) Azure Blob:
Blob (binary large object) Storage is
an object storage solution. It is the
cheapest option to store unstructured
data (no restriction on the type of
data) that won’t be queried.
Every blob lives inside a blob container. You can store an unlimited number of blobs in a
container and an unlimited number of containers in a storage account. Containers are
"flat"; they can only store blobs, not other containers. Blob Storage does not provide
any mechanism for searching or sorting blobs by metadata.
📌 Technically, containers are "flat" and don't support any kind of nesting or
hierarchy. But if you give your blobs hierarchical names that look like file paths
(such as finance/budgets/2017/q1.xls), the API's listing operation can filter
results to specific prefixes. This enables you to navigate the list as if it was a
hierarchical system of files and folders. This feature is often called virtual
directories.
Here enters Microsoft Blob Storage which could store massive amounts of
unstructured data. Blob storage is a general-purpose object storage that provides
Differences:
Clusters are tightly coupled with HDFS Storage is separate from clusters
• ABFS (Azure blob file system) is a dedicated driver for Hadoop running on Azure blob
storage. Think of the data as if it's stored in a Hadoop Distributed File System (HDFS)
which means that Azure Data Lake Storage organizes the stored data into a hierarchy of
directories and subdirectories, much like a file system, for easier navigation. As a result,
data processing requires fewer computational resources, reducing both the time and cost.
➖ {Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/
📌 TIP: Avoid putting date folders at the beginning as it makes applying ACLs to
every subfolder more tedious.
2) Azure Files:
Azure Files offers fully managed file
shares in the cloud that can be
accessed and managed like a file
server using the industry standard
Server Message Block (SMB) and
Network File System (NFS)
protocols.
Shared data between on-premises applications and Azure VMs to allow migration of
apps to the cloud over a period of time.
Storing shared configuration files for VMs, tools, or utilities so that everyone is using
the same version. Log files such as diagnostics, metrics, and crash dumps.
3) Azure Queue:
A messaging store used to store a
large number of messages that can
There are two types of queues: Storage queues and Service Bus.
Storage queues can be used for simple asynchronous message processing. They
can store up to 500 TB of data (per storage account) and each message can be up to
64 KB in size.
Service Bus provides advanced features plus the message sizes can be up to 1 MB
but the overall size is capped at 80 GB.
4) Azure Table:
A NoSQL store that hosts
unstructured data independent of
any schema. It makes use of tables
containing key-value data items.
📌 To read data in archive storage, you must first change the tier of the blob to hot
or cool. This process is known as rehydration and can take hours to complete.
Optimize data access patterns – reduce unnecessary scanning of files, read only
the data you need to read
In the case of processing real-time data, you can use a real-time streaming engine
(such as Azure Stream Analytics or Spark Streaming) in conjunction with a message
broker (such as Event Hub or Apache Kafka) to store your data as larger files.
2) Partitioning Strategy
An effective partitioning scheme for your data can improve the performance of your
analytics pipeline and also reduce the overall transaction costs incurred with your query.
In simplistic terms, partitioning is a way of organizing your data by grouping datasets with
similar attributes together in a storage entity, such as a folder. When your data processing
pipeline is querying for data with that similar attribute (E.g. all the data in the past 12
hours), the partitioning scheme (in this case, done by DateTime) lets you skip over the
irrelevant data and only seek the data that you want.
Raw data: This is data as it comes from the source systems. This data is stored as is
in the data lake and is consumed by an analytics engine such as Spark to perform
cleansing and enrichment operations to generate the curated data. The data in the
raw zone is sometimes also stored as an aggregated data set, e.g. in the case of
streaming scenarios, data is ingested via a message bus such as Event Hub, and then
aggregated via a real-time processing engine such as Azure Stream Analytics or
Spark Streaming before storing in the data lake.
Enriched data: This layer of data is the version where raw data (as is or aggregated)
has a defined schema and also, and the data is cleansed, and enriched (with other
sources), and is available to analytics engines to extract high-value data.
Curated data: This layer of data contains the high-value information that is served to
the consumers of the data – the BI analysts and the data scientists. This data has
structure and can be served to the consumers either as is (E.g. data science
notebooks) or through a data warehouse. Data assets in this layer are usually highly
governed and well documented.
Workspace data: In addition to the data that is ingested by the data engineering team
from the source, the consumers of the data can also choose to bring other data sets
that could be valuable. In this case, the data platform can allocate a workspace for
✅ Latency metrics
Request rate is measured in Input/output operations per second (IOPS). The request
rate is calculated by dividing the time it takes to complete one request by the number of
requests per second. E.g. Let us assume that a request from a single thread application
with one outstanding read/write operation takes 10 ms to complete.
Request Rate = 1sec/10ms = 1000ms/10ms = 100 IOPS
This means the outstanding read/write would achieve a request rate of 100 IOPS.
Azure Storage provides two latency metrics for block blobs. These metrics can be viewed
in the Azure portal:
End-to-end (E2E) latency measures the interval from when Azure Storage receives
the first packet of the request from a client until Azure Storage receives a client
acknowledgment on the last packet of the response.
Server latency measures the interval from when Azure Storage receives the last
packet of the request from a client until the first packet of the response is returned
from Azure Storage.
📌 TIP: Periodically rotate access keys to ensure they remain private, just like
changing your passwords. We can also use an Azure Key Vault to store the
access key which includes the support to synchronize directly to the Storage
Account and automatically rotate the keys periodically.
📌 ACL can never supersede an RBAC role. It can only augment the role with
additional permissions. For ex: A user who has been provided RBAC on blob
storage whereas in the ACL list, has been denied all the read, write and
execute permissions will still have this permission through the RBAC role.
📌 RBAC provides course grain permissions to the data lake or to folders inside it.
These are used to allow or deny permissions to the folder structure but
typically do not dictate the ability of the user to perform actions against the
data.
ACLs are used to define the fine grain permissions to the data, this is where
the ability of the user to read, write, modify or delete the data is set.
6) Encryption at Transit
7) Encryption at Rest
Encryption at rest is enabled by default and can’t be disabled. All data written to Azure
Storage is automatically encrypted by Storage Service Encryption (SSE) with a 256-bit
Advanced Encryption Standard (AES) cipher.
SSE automatically encrypts data when writing it to Azure Storage. When you read data
from Azure Storage, Azure Storage decrypts the data before returning it. This process
incurs no additional charges and doesn't degrade performance. It can't be disabled.
Azure Data Factory can have one or more pipelines. A pipeline is a logical grouping of
activities that together perform a task. For example, a pipeline could contain a set of
activities that ingest and clean log data, and then kick off a mapping data flow to analyze
the log data. The pipeline allows you to manage the activities as a set instead of each one
individually. You deploy and schedule the pipeline instead of the activities independently.
Now, a dataset is a named view of data that simply points or references the data you
want to use in your activities as inputs and outputs. Before you create a dataset, you
must create a linked service to link your data store to the Data Factory. Linked services
are like connection strings, which define the connection information needed for the
service to connect to external resources.
Think of it this way; the dataset represents the structure of the data within the linked data
stores, and the linked service defines the connection to the data source. For example, to
copy data from Blob storage to a SQL Database, you create two linked services: Azure
Storage and Azure SQL Database. Then, create two datasets: an Azure Blob dataset
(which refers to the Azure Storage linked service) and an Azure SQL Table dataset (which
refers to the Azure SQL Database linked service).
In Data Flow, datasets are used in source and sink transformations. The datasets define
the basic data schemas. If your data has no schema, you can use schema drift for your
source and sink (more on schema drift later). Pipeline runs are typically instantiated by
✅ Linked Service
The Linked Service represents the connection information that enables the ingestion of
data from external resources such as a data store (Azure SQL Server) or compute
service (Spark Cluster).
✅ Dataset
Datasets represent data structures within your data stores. These point to (or reference)
the data that we want to use in our activities and are referenced by the Linked service.
📌 Annotations:
When monitoring data pipelines, you may want to be able to filter and monitor a
certain group of activities, such as those of a project or specific department's
pipelines. You can achieve these using annotations.
Annotations are tags that you can add (only static values) to pipelines,
datasets, linked services, and triggers to easily identify them. For more, click
here.
Because there are many activities that are possible in a pipeline in Azure Data Factory,
activities can be grouped into three categories:
✅ Data Flows
Data Flows are used to build code-free transformation data flows/
transformation logic that is executed on automatically provisioned
Apache Spark clusters. ADF internally handles all the code
translation, spark optimization, and execution of the transformation.
Data flow has a unique authoring canvas designed to make building transformation logic
easy. The data flow canvas is separated into three parts: the top bar, the graph, and the
configuration panel.
→ Graph
The graph displays the transformation stream. It
shows the lineage of source data as it flows into
one or more sinks. To add a new source,
select Add source. To add a new
transformation, select the plus sign on the lower
right of an existing transformation.
→ Configuration panel
The configuration panel shows the settings specific to the currently selected
transformation. If no transformation is selected, it shows the data flow. In the overall data
flow configuration, you can add parameters via the Parameters tab. Each transformation
contains at least four configuration tabs. Read more here
1) Transformation settings
2) Optimize
The Optimize tab contains settings to configure partitioning schemes.
3) Inspect
The Inspect tab provides a view into the metadata of the data stream that you're
transforming. You can see column counts, the columns changed, the columns added,
data types, the column order, and column references. Inspect is a read-only view of
your metadata. You don't need to have debug mode enabled to see metadata in
the Inspect pane.
4) Data preview
If debug mode is on, the Data Preview tab gives you an interactive snapshot of the
data at each transform.
Note: Filter transformation in data flow is different from Filter activity in control
flow
Here is a sample expression that can be used to create date directories and automatic
partitioning:
Schema drift is the case where your sources often change metadata. Fields,
columns, and, types can be added, removed, or changed on the fly. Without
handling schema drift, your data flow becomes vulnerable to upstream data
source changes.
If schema drift is enabled, make sure the Auto-mapping slider in the Mapping tab is turned
on. With this slider on, all incoming columns are written to your destination. Otherwise, you
must use rule-based mapping to write drifted columns.
✅ Parameters
1) Dataset Parameters
When working with a database with multiple tables in it, instead of creating a new dataset
for using each of them, dataset parameters can be used to pass the table names at run
time.
You will need to create a dataset without mentioning the table name while creating it
<parameter name>
Use the pipeline control flow expression language to set a dynamic value.
The reason for parameterizing mapping data flows is to make sure that your data flows
are generalized, flexible, and reusable.
Data flow is one of the activities in ADF pipeline, so the way to pass the parameters
to it is the same as passing pipeline parameters above.
✅ Integration Runtime
ADF is a managed service (PaaS) i.e it will create the required computing infrastructure to
complete the activity. This is known as integration runtime. Thus IR provides a fully
managed, serverless computing infrastructure. There are three types of Integration
Runtime which are discussed later.
The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory
and Azure Synapse pipelines to provide the following data integration capabilities across
different network environments:
Data movement: Copy data across data stores in a public or private network (for
both on-premises or virtual private networks).
SSIS package execution: Natively execute SQL Server Integration Services (SSIS)
packages in a managed Azure compute environment.
The Azure-SSIS IR is a fully managed cluster of Azure VMs dedicated to running your
SSIS packages.
Link to an example showing the detailed implementation steps
✅ Triggers
Azure Data Factory Triggers come in three different types: Schedule Trigger, Tumbling
Window Trigger, and Event-based Trigger.
1) Schedule Trigger
This Azure Data Factory Trigger is a popular
trigger that can run a Data Pipeline according to
a predetermined schedule. It provides extra
flexibility by allowing for different scheduling
intervals like a minute(s), hour(s), day(s),
week(s), or month(s).
Azure Data Factory can help iteratively debug Data Factory pipelines when developing
data integration solutions. You don't need to publish changes in the pipeline or activities
Sometimes, you don't want to debug the whole pipeline but test a part of the pipeline. You
can test the pipeline end to end or set a breakpoint. By doing so in debug mode, you can
interactively see the results of each step while you build and debug your pipeline.
✅ Azure Databricks
Databricks is a comprehensive data analytics solution built on
Apache Spark and offers native SQL capabilities as well as
workload-optimized Spark clusters for data analytics and data
science. Databricks provides an interactive user interface
Remember that Spark is a replacement for MapReduce, not Hadoop. It's a part of the
ecosystem. Thus, Spark requires two more things to work: Storage (local
storage/HDFS/Amazon S3) and Resource Manager (YARN/Mesos/Kubernetes). Spark is
Workspace
It is an environment for accessing all of Azure Databricks assets. The workspace
organizes objects such as notebooks, libraries, queries, and dashboards into folders, and
provides access to data and computational resources such as clusters and jobs. Each
workspace is isolated from others and each workspace has its own identifier.
1) Interactive Cluster-
Multiple users can interactively analyze the data together. Need to terminate the cluster
after job completion. These are comparatively costly and can autoscale on demand.
1. Standard Cluster Mode- This is used for single-user use, and provides no fault
isolation. Supports Scala, Python, SQL, R, and Java.
2. High Concurrency Cluster Mode- This is used for multiple users, and provides
fault isolation along with maximum cluster utilization. Supports only Python, SQL
& R. The performance, security, and fault isolation of high concurrency clusters is
provided by running user code in separate processes, which is not possible in
Scala.
Hive metastore
Hive is an open-source technology used to define a relational abstraction layer of tables
over file-based data. The tables can then be queried using SQL syntax. The table
definitions and details of the file system locations on which they're based are
stored in the metastore for a Spark cluster. A Hive metastore is created for each
cluster when it's created, but you can configure a cluster to use an existing external
metastore if necessary.
Delta Lake
SQL Warehouses
SQL Warehouses are relational compute resources with endpoints that enable client
applications to connect to an Azure Databricks workspace and use SQL to work with data
in tables. SQL Warehouses are only available in premium tier Azure Databricks
workspaces.
✅ Internal working
In Databricks, the notebook interface is typically the driver program. SparkContext, an
object of the driver program runs the main function, creates distributed datasets on the
cluster, applies parallel operations to the cluster nodes, and then collects the results of
the operations.
Driver programs access Apache Spark through a SparkSession object. The nodes read
and write data from and to the file system and cache transformed data in-memory as
Resilient Distributed Datasets (RDDs). The SparkContext is responsible for converting
an application to a directed acyclic graph (DAG). The graph consists of individual tasks
that get executed within an executor process on the nodes. Each application gets its own
executor processes, which stays up for the duration of the whole application and run
tasks in multiple threads.
Each executor has multiple slots to which parallelized tasks can be assigned.
Mounting file-based storage to DBFS using Service Principal allows seamless access to data from the
storage account without requiring credentials after the first time
productSchema = StructType([
StructField("ProductID", IntegerType()),
StructField("ProductName", StringType()),
StructField("Category", StringType()),
StructField("ListPrice", FloatType())
])
df = spark.read.load('/data/product-data.csv',
format='csv',
schema=productSchema,
header=False)
display(df.limit(10))
#Filtering
pricelist_df = df["ProductID", "ListPrice"]
bikes_df = df["ProductName", "ListPrice"].where((df["Category"]=="Mountain Bikes") | (df["C
ategory"]=="Road Bikes"))
#Grouping
counts_df = df.select("ProductID", "Category").groupBy("Category").count()
# We can use the %%sql magic to run SQL code that queries objects in the catalog
%%sql
✅ Delta lake
Delta Lake is an open-source storage layer for Spark that enables relational database
capabilities for batch and streaming data. By using Delta Lake, you can implement a data
lakehouse architecture in Spark to support SQL based data manipulation semantics with
support for transactions and schema enforcement. The result is an analytical data store
that offers many of the advantages of a relational database system with the flexibility of
data file stored in a data lake.
The benefits of using Delta Lake in Azure Databricks include:
Relational tables that support querying and data modification. With Delta Lake,
you can store data in tables that support CRUD (create, read, update, and delete)
operations. In other words, you can select, insert, update, and delete rows of data in
the same way you would in a relational database system.
Data versioning and time travel. Because all transactions are logged in the
transaction log, you can track multiple versions of each table row, and even use
the time travel feature to retrieve a previous version of a row in a query.
Support for batch and streaming data. While most relational databases include
tables that store static data, Spark includes native support for streaming data through
the Spark Structured Streaming API. Delta Lake tables can be used as
both sinks (destinations) and sources for streaming data.
Standard formats and interoperability. The underlying data for Delta Lake tables is
stored in Parquet format, which is commonly used in data lake ingestion pipelines.
Additionally, you can use the serverless SQL pool in Azure Synapse Analytics to
query Delta Lake tables in SQL.
After saving the delta table, the path location you specified includes parquet files for the
data (regardless of the format of the source file you loaded into the dataframe) and
a _delta_log folder containing the transaction log for the table.
Note: The transaction log records all data modifications to the table. By logging each
modification, transactional consistency can be enforced and versioning information for the
table can be retained.
# To make modifications to a Delta Lake table, you can use the DeltaTable object in the Del
ta Lake API, which supports update, delete, and merge operations. For example, you could us
e the following code to update the price column for all rows with a category column value o
f "Accessories"
df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)
# OR
df = spark.read.format("delta").option("timestampAsOf", '2022-01-01').load(delta_table_pat
h)
A managed table is defined without a specified location, and the data files are stored
within the storage used by the metastore. Dropping the table not only removes its
metadata from the catalog but also deletes the folder in which its data files are stored.
An external table is defined for a custom file location, where the data for the table is
stored. The metadata for the table is defined in the Spark catalog. Dropping the table
deletes the metadata from the catalog, but doesn't affect the data files.
# We can use the DeltaTableBuilder API (part of the Delta Lake API) to create a catalog tab
le
from delta.tables import *
DeltaTable.create(spark) \
.tableName("default.ManagedProducts") \
.addColumn("Productid", "INT") \
.addColumn("ProductName", "STRING") \
.addColumn("Category", "STRING") \
✅ Monitoring
1) Ganglia
Built-in databricks monitoring service that collects data every 15 min by default. We
can access this option by going into the cluster and select Metrics from the header.
2) Azure Monitor
No native support for Databricks so setting this up is cumbersome.
Azure provides the Azure Databricks version for customers who love the features of Databricks Spark. It
provides HDInsight Spark for customers who prefer OSS technologies, and it also provides Synapse
Spark, which is a performance-boosted version of the OSS Spark for those customers who prefer an
integrated single-pane experience within Azure Synapse Analytics.
6) Synapse Studio
Synapse Studio provides a single way for enterprises to build solutions, maintain, and
secure all in a single user experience using a web-based portal
Monitor resources, usage, and users across SQL, Spark, and Data Explorer
Write SQL, Spark, or KQL code and integrate with enterprise CI/CD processes
The tables listed under the SQL pool store data from multiple sources, such as SAP
Hana, Twitter, Azure SQL Database, and external files copied over from an
orchestration pipeline. Synapse Analytics gives us the ability to combine these data
sources for analytics and reporting, all in one location.
2) Develop
Expand each of the groups under the Develop menu. The Develop hub in our sample
environment contains examples of the following artifacts:
3) Integrate
Manage integration pipelines within the Integrate hub. If you are familiar with Azure
Data Factory, then you will feel at home in this hub. The pipeline creation experience
is the same as in ADF, which gives you another powerful integration built into Synapse
Analytics, removing the need to use Azure Data Factory for data movement and
transformation pipelines.
Expand Pipelines and select Master Pipeline (1). Point out the Activities (2) that can
be added to the pipeline, and show the pipeline canvas (3) on the right.
Show each of the monitoring categories grouped under Integration and Activities.
Pipeline runs shows all pipeline run activities. You can view the run details,
including inputs and outputs for the activities, and any error messages that
occurred. You can also come here to stop a pipeline, if needed.
Trigger runs shows you all pipeline runs caused by automated triggers. You can
create triggers that run on a recurring schedule or tumbling window. You can also
create event-based triggers that execute a pipeline any time a blob is created or
deleted in a storage container.
Apache Spark applications shows all the Spark applications that are running or
have run in your workspace.
SQL requests shows all SQL scripts executed either directly by you or another
user, or executed in other ways, like from a pipeline run.
Data flow debug shows active and previous debug sessions. When you author a
data flow, you can enable the debugger and execute the data flow without needing
to add it to a pipeline and trigger an execute. Using the debugger speeds up and
simplifies the development process. Since the debugger requires an active Spark
cluster, it can take a few minutes after you enable the debugger before you can
use it.
5) Manage
Show each of the management categories grouped under Analytics pools, External
connections, Integration, and Security.
SQL pools. Lists the provisioned SQL pools and on-demand SQL serverless
pools for the workspace. You can add new pools or hover over a SQL pool
Apache Spark pools. Lets you manage your Spark pools by configuring the auto-
pause and auto-scale settings. You can provision a new Apache Spark pool from
this blade.
Access control. This is where you go to add and remove users to one of three
security groups: workspace admin, SQL admin, and Apache Spark for Azure
Synapse Analytics admin.
Managed private endpoints. This is where you manage private endpoints, which
use a private IP address from within a virtual network to connect to an Azure
service or your own private link service. Connections using private endpoints
listed here provide access to Synapse workspace endpoints (SQL, SqlOndemand
and Dev).
Step 2: Control node provides commands to multiple compute nodes (the number
depends on the option we selected during setup - DWU) which will work in parallel to
compute the query. The Compute nodes provide the computational power.
Distributions map to Compute nodes for processing.
Step 3: The Data Movement Service (DMS) is a system-level internal service that
moves data across the nodes as necessary to run queries in parallel and return
accurate results. The number of compute nodes ranges from 1 to 60 and is
determined by the service level for Synapse SQL.
Step 4: A key architectural component of dedicated SQL pools is the decoupled
storage that is segmented into 60 parts. The data is shared by these distributions in
the data layer to optimize the work performance. Distribution is the basic unit of
storage and parallel queries process these distributed data.
When Synapse SQL runs a query, the work is divided into 60 smaller queries that run
in parallel. Each of the 60 smaller queries runs on one of the data distributions. Each
Compute node manages one or more of the 60 distributions. Since there are 60
storage segments and a maximum of 60 MPP compute nodes within the highest
performance configuration of SQL Pools, a 1:1 file to compute node to storage
segment may be viable for ultra-high workloads.
With decoupled storage and compute, when using a dedicated SQL pool, we
can scale each of these independently.
If you use Apache Spark for Azure Synapse in your data pipeline, for data
preparation, cleansing, or enrichment, you can query external Spark tables
you've created in the process, directly from the serverless SQL pool. (more on
this later)
Only create external data tables If one needs to persist the data
Charged based on the amount of data processes Charges based on DWU (Data Warehouse
(as there’s no underlying infrastructure) Units)
For more detailed differences check this link. Also for best practices, check this.
Access to recent data (maybe only 2022 Access to older data going back years (all
data) historical data in order to perform analysis)
📌 Ideally, try to replace NULL values with some default values in the
dimension tables, as not doing this can give undesired results while using
reporting tools.
A Fact table can have NULL values with the exception of the key columns
which will be used for joining to the dimension table
-- First let's ensure we have the tables defined in the SQL pool
Go to Synapse Studio → Integrate → Copy Data tool → Run Once now → Create
connection (define source settings)→ Select Azure SQL Database → Fill details
→ Select the required tables for copying → Create connection (define target
settings) → Select Azure Synapse → Fill details → Option to select column
mapping (if we want to drop certain columns) → Set the staging account details in
the Settings option → In the advanced option we have option to select the copying
procedure (PolyBase, Copy Command or Bulk insert) → review and deploy the
pipeline
-- The above statement only returns all as a single string line by line
-- Next we can cast to seperate columns
SELECT
CAST(JSON_VALUE(jsonContent,'$.Id') AS INT) AS Id,
JSON_VALUE(jsonContent,'$.Correlationid') As Correlationid,
JSON_VALUE(jsonContent,'$.Operationname') AS Operationname,
JSON_VALUE(jsonContent,'$.Status') AS Status,
JSON_VALUE(jsonContent,'$.Eventcategory') AS Eventcategory,
JSON_VALUE(jsonContent,'$.Level') AS Level,
CAST(JSON_VALUE(jsonContent,'$.Time') AS datetimeoffset) AS Time,
JSON_VALUE(jsonContent,'$.Subscription') AS Subscription,
JSON_VALUE(jsonContent,'$.Eventinitiatedby') AS Eventinitiatedby,
JSON_VALUE(jsonContent,'$.Resourcetype') AS Resourcetype,
JSON_VALUE(jsonContent,'$.Resourcegroup') AS Resourcegroup
FROM
OPENROWSET(
BULK 'https://appdatalake7000.dfs.core.windows.net/data/log.json',
FORMAT = 'CSV',
FIELDQUOTE = '0x0b',
FIELDTERMINATOR ='0x0b',
ROWTERMINATOR = '0x0a'
)
WITH (
jsonContent varchar(MAX)
) AS [rows]
1) Hash
Synapse dedicated pools support 2) Round-robin
sharding for all these table types. 3) Replicated
They provide three different ways to
shard the data, as follows:
These methods through which a SQL dedicated pool distributes data among its
tables are also called distribution techniques. Sharding and distribution
1) Hash-distributed tables
(use this on a fact table with a hash column selected carefully)
Highest query performance for joins and aggregations on large tables
This works quicker if the query aggregation works on the hash column that we
defined.
When choosing the hash column, try to avoid columns having data skew as it would
lead to uneven distribution of rows across the nodes. Also, avoid selecting the date
column.
A quick way to check for data skew is to use DBCC PDW_SHOWSPACEUSED . The following SQL
code returns the number of table rows that are stored in each of the 60 distributions.
For balanced performance, the rows in your distributed table should be spread evenly
across all the distributions.
- Find data skew for a distributed table
DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');
2. Replicated tables
(use it for dimension tables that are smaller in size <2 GB)
Simplest table to create and delivers fast performance when used as a staging
table for loads
A round-robin distributed table distributes data evenly across the table but without
any further optimization. A distribution is first chosen at random and then buffers of
rows are assigned to distributions sequentially. Joins on round-robin tables require
reshuffling data, which takes additional time as it takes time to move data over from
other nodes and collate all the rows together.
Consider this option if there are no joins performed on the tables or in the case when
we don’t have a clear candidate column for the hash distributed table.
Consider using the round-robin distribution for your table in the following scenarios:
If the table does not share a common join key with other tables
Summary
Type of
Best Fit for… Do not use when…
Distribution
To balance the parallel processing, select a distribution column or set of columns that:
Has many unique values. The distribution column(s) can have duplicate values.
All rows with the same value are assigned to the same distribution. Since there
are 60 distributions, some distributions can have > 1 unique value while others
may end with zero values.
Does not have NULLs, or has only a few NULLs. For an extreme example, if all
values in the distribution column(s) are NULL, all the rows are assigned to the
Is not a date column. All data for the same date lands in the same distribution or
will cluster records by date. If several users are all filtering on the same date (such
as today's date), then only 1 of the 60 distributions do all the processing work.
✅ Indexing
In SQL-based systems, you might be required to access rows using values other than
the primary key. In such cases, the query engine needs to scan all the rows to find the
value we are looking for. Instead, if we can define a secondary index based on
frequently searched column values, we could avoid the complete table scans and
speed up the query. The secondary index tables are calculated separately from the
primary indexes of the table, but this is done by the same SQL engine.
A well-designed indexing strategy can reduce disk I/O operations and consume less
system resources therefore improving query performance, especially when using
filtering, scans, and joins in a query.
However, this kind of index can’t be created with columns that are of type varchar,
nvarchar, varbinary. Also clustered columnstore index is not ideal for small tables
having less than 60 million rows and also for transient data.
Code Example
* Staging/temporary table *
Heap Small tables with small * Any lookup scans the full table
lookups
/*If you intend to use a snowflake schema in which dimension tables are related to one a
nother, you should include the key for the parent dimension in the definition of the chi
ld dimension table. For example, the following SQL code could be used to move the geogra
phical address details from the DimCustomer table to a separate DimGeography dimension t
able:*/
/*The following code example creates a hypothetical fact table named FactSales that is r
elated to multiple dimensions through key columns (date, customer, product, and store)*/
/*Staging tables are used as temporary storage for data as it's being loaded into the da
ta warehouse. The following code example creates a staging table for product data that w
ill ultimately be loaded into a dimension table:*/
/*In some cases, if the data to be loaded is in files with an appropriate structure, it
can be more effective to create external tables that reference the file location. This
way, the data can be read directly from the source files instead of being loaded into t
he relational store.*/
✅ Table Partition
Logically splitting data into smaller manageable parts based on some column value
e.g. splitting sales data by different provinces of Canada. Normally, data is
partitioned on the date column.
Code Example
Partition Switching
) )
AS
SELECT *
FROM logdata
WHERE 1=2 --Since its always false, it will just copy the schema of the logdata to
the logadata_new
Over time, it is possible that certain product name changes or maybe a customer
changes phone number. This will lead to the case where we will have to change the
dimension table to reflect these changes. There are various strategies to tackle the
different cases.
• Type 1
E.g. When a customer's email address or phone number changes, the dimension
table updates the customer row with the new values.
• Type 2 (important):
📌 NOTE: Surrogate keys are secondary row identification keys. They are
added in all SCD2 cases because the primary identification key will not
be unique anymore with newly added rows.
/*Logic to implement Type 1 and Type 2 updates can be complex, and there are var
ious techniques you can use. For example, you could use a combination of UPDATE
and INSERT statements as shown in the following code example:*/
/*As an alternative to using multiple INSERT and UPDATE statement, you can use a
single MERGE statement to perform an "upsert" operation to insert new records an
d update existing ones, as shown in the following example, which loads new produ
ct records and applies type 1 updates to existing products*/
/*Another way to load a combination of new and updated data into a dimension tab
le is to use a CREATE TABLE AS (CTAS) statement to create a new table that conta
ins the existing rows from the dimension table and the new and updated records f
rom the staging table. After creating the new table, you can delete or rename th
e current dimension table, and rename the new table to replace it.*/
• Type 3:
Here instead of having multiple rows to signify changes, we have multiple columns.
We do have an effective/modified date column to show when the change took
place.
When using the windowing function with SQL pools, we will use the OVER clause.
This clause determines the partitioning and ordering of a rowset before the window
function is applied.
Code Example
SELECT
ROW_NUMBER() OVER(
PARTITION BY [ProductID]
ORDER BY [OrderQty]) AS "Row Number"
,[ProductID]
,[CustomerID]
,[OrderQty]
,[UnitPrice]
FROM [dbo].[SalesFact]
ORDER BY [ProductID]
SELECT
ROW_NUMBER() OVER(
PARTITION BY [ProductID]
ORDER BY [OrderQty]) AS "Row Number"
,[ProductID]
,[CustomerID]
,[OrderQty]
,SUM([OrderQty]) OVER(
PARTITION BY [ProductID]) AS TotalOrderQty
,[UnitPrice]
FROM [dbo].[SalesFact]
ORDER BY [ProductID]
✅ Surrogate Keys
Generally, data for dimension tables can come from multiple sources and if the
primary key column for these tables is the same then we won’t have a way to
distinguish between the different rows. Thus we would want to have surrogate keys
Code Example
-- First let's ensure we have the tables defined in the SQL pool
-- Let's do this for one dimension table
Other considerations
• Masked columns can be updated if the user has permission
• Export masked from source data results in masked data in the target table
Example definition
A random masking function for use on any
syntax: Account_Number
Random numeric type to mask the original value with a bigint MASKED WITH
random value within a specified range. (FUNCTION = 'random([start
range], [end range])')
Code Example
-- Azure Example
--DROP TABLE TestDDM
Create table TestDDM
(ID Int,
PersonName varchar (100),
EmailAddress varchar(120),
CreditCardNumber varchar(19),
SocialSecurityNumber varchar(11)
)
/* After this we can go into Azure -> Security -> Dynamic Data masking where we can
provide all the functions. Similar task can be done from SQL query as well */
Dedicated SQL pool workload management in Azure Synapse consists of three high-
level concepts:
Workload Classification
Workload Importance
Workload importance influences the order in which a request gets access to
resources. On a busy system, a request with higher importance has first access to
resources. There are five levels of importance: low, below_normal, normal,
above_normal, and high. Requests that don't set importance are assigned the
default level of normal.
Workload Isolation
/* We can create multiple workload groups in order to provision compute resources such t
hat two different tasks such that an user loading data doesn't use the full resource cap
acity when an user already performing some analysis job*/
✅ Materialized Views
Views are logical projections of data from multiple tables. A standard view computes
its data each time when the view is used. There's no data stored on disk. A
materialized view pre-computes, stores, and maintains its data in a dedicated
Stored in Azure
View definition Stored in Azure data warehouse.
data warehouse.
Speed to retrieve
view data from Slow Fast
complex queries
Materialized views results in increased performance since the data within the view can
be fetched without having to resolve the underlying query to base tables. You can also
further filter and supplement other queries as if it is a table also. In addition, you also
can define a different table distribution within the materialized view definition that is
different from the table on which it is based. As the data in the underlying base tables
change, the data in the materialized view will automatically update without user
interaction.
There are several restrictions that you must be aware of before defining a materialized
view:
The SELECT list in the materialized view definition needs to meet at least one of
these two criteria:
<distribution_option> ::=
{
DISTRIBUTION = HASH ( distribution_column_name )
| DISTRIBUTION = HASH ( [distribution_column_name [, ...n]] )
| DISTRIBUTION = ROUND_ROBIN
}
<select_statement> ::=
SELECT select_criteria
--Example
--When MIN/MAX aggregates are used in the SELECT list of materialized view definition, F
OR_APPEND is required
Row-level security (RLS) can help you to create a group membership or execution
context in order to control not just columns in a database table, but actually, the rows.
The way to implement RLS is by using the CREATE SECURITY POLICY statement. For
reading more.
It tracks cardinality and range density to determine which data access paths
return the fewest rows for speed.
For example, if the optimizer estimates that the date your query is filtering on will
return one row, it will choose one plan. If it estimates that the selected date will return
1 million rows, it will return a different plan.
✅
If you want to learn more about this, click here.
Spark pool clusters are groups of computers that are treated as a single computer and
handle the execution of commands issued from notebooks. The clusters allow the
processing of data to be parallelized across many computers to improve scale and
performance. It consists of a Spark Driver and Worker nodes. Spark pools in Azure
Synapse can use Azure Data Lake Storage Generation 2 as well as BLOB
storage.
The primary use case for Apache Spark for Azure Synapse Analytics is to process big
data workloads that cannot be handled by Azure Synapse SQL, and where you don’t
have an existing Apache Spark implementation.
There are two ways within Synapse to use Spark:
Spark Notebooks for doing Data Science and Engineering use Scala, PySpark,
C#, and SparkSQL
Spark job definitions for running batch Spark jobs using jar files.
Indexing
In Azure, we have technologies that can perform indexing on huge volumes of data.
These indexes can then be used by analytical engines such as Spark to speed up the
queries. One such technology that Azure offers is called Hyperspace.
Hyperspace lets us create indexes on input datasets such as Parquet, CSV, and so
on, which can be used for query optimization. The Hyperspace indexing needs to be
run separately to create an initial index. After that, it can be incrementally updated for
the new data. Once we have the Hyperspace index, any Spark query can leverage the
index, similar to how we use indexes in SQL.
✅ Delta lake
Delta Lake is an open-source storage layer for Spark that enables relational
database capabilities for batch and streaming data. By using Delta Lake, you can
implement a data lakehouse architecture in Spark to support SQL_based data
Relational tables that support querying and data modification. With Delta
Lake, you can store data in tables that support CRUD (create, read, update, and
delete) operations. In other words, you can select, insert, update, and delete rows
of data in the same way you would in a relational database system.
Data versioning and time travel. Because all transactions are logged in the
transaction log, you can track multiple versions of each table row, and even use
the time travel feature to retrieve a previous version of a row in a query.
Support for batch and streaming data. While most relational databases include
tables that store static data, Spark includes native support for streaming data
through the Spark Structured Streaming API. Delta Lake tables can be used as
both sinks (destinations) and sources for streaming data.
The JDBC API opens the connection, filters, and applies projections, and Apache
Spark reads the data serially. Given that two distributed systems such as Apache
Therefore, a new approach is to use both JDBC and PolyBase. First, the JDBC opens
a connection, issues Create External Tables As Select (CETAS) statements, and
sends filters and projections. The filters and projections are then applied to the data
warehouse and exported in parallel using PolyBase. Apache Spark reads the data in
parallel based on the user-provisioned workspace and the default data lake storage.
As a result, you can use the Azure Synapse Apache Spark Pool to Synapse SQL
connector to transfer data between a Data Lake store via Apache Spark and
dedicated SQL Pools efficiently.
When you deploy an Azure Synapse Apache Spark cluster, the Azure Data Lake
Gen2 capability enables you to store Apache Spark SQL Tables within it. If you
use Apache Spark SQL tables, these tables can be queried from a SQL-based
Transact-SQL language without needing to use commands like CREATE
EXTERNAL TABLE. Within Azure Synapse Analytics, these queries integrate
natively with data files that are stored in an Apache Parquet format.
The integration can be helpful in use cases where you perform an ETL process
predominately using SQL but need to call on the computation power of Apache Spark
Authentication
The authentication between the two systems is made seamless in Azure Synapse
Analytics. The Token Service connects with Azure Active Directory to obtain the
security tokens to be used when accessing the storage account or the data
warehouse in the dedicated SQL pool.
For this reason, there's no need to create credentials or specify them in the connector
API if Azure AD-Auth is configured at the storage account and the dedicated SQL
pool. If not, SQL Authentication can be specified. The only constraint is that this
connector only works in Scala.
Read more
Min number of nodes.
Max number of nodes.
The initial number of nodes will be the minimum. This value defines the initial size of
the instance when it's created. The minimum number of nodes can't be fewer than
three.
You can also modify this in the Azure portal, you can click on the auto-scale
settings icon
✅ SYNAPSE LINK
Hybrid Transactional and Analytical Processing enables businesses to perform
analytics over a database system that is seen to provide transactional capabilities
without impacting the performance of the system. This enables organizations to use a
database to fulfill both transactional and analytical needs to support near real-time
analysis of operational data to make decisions about the information that is being
analyzed.
Now, a dataset is a named view of data that simply points or references the data you
want to use in your activities as inputs and outputs. Before you create a dataset, you
must create a linked service to link your data store to the Data Factory or Synapse
Workspace. Linked services are like connection strings, which define the connection
information needed for the service to connect to external resources. Think of it this
way; the dataset represents the structure of the data within the linked data stores, and
the linked service defines the connection to the data source. For example, to copy
data from Blob storage to a SQL Database, you create two linked services: Azure
Storage and Azure SQL Database. Then, create two datasets: an Azure Blob dataset
(which refers to the Azure Storage linked service) and an Azure SQL Table dataset
(which refers to the Azure SQL Database linked service).
Finally, The Integration Runtime (IR) provides the compute infrastructure for
completing a pipeline. We have the same three types of IR: Azure, Self-hosted, and
Azure-SSIS.
Integration
Runtime
Using SSIS and SSIS
Integration Runtime
✓ ✓Public preview
Support for Cross-region
Integration Runtime (Data ✓ ✗
Flows)
✓Can be shared
Integration Runtime Sharing across different data ✗
factories
Pipelines
Activities
SSIS Package Activity ✓ ✓Public preview
Support for Power Query
Activity (Wrangling Data ✓ ✗
Flow)
GIT Repository
Integration
GIT Integration ✓ ✓
Monitoring of Spark Jobs for
✓Leverage the
Monitoring ✗ Synapse Spark
Data Flow
pools
✅
💻 DP-203 Notes by Neil Bagchi 120
✅ Data Flows
Since it has already been covered in ADF, please refer there using this link.
✅ Loading Methods
Analytical systems are constantly balanced between loading and querying workloads.
One of the main design goals in loading data is to manage or minimize the impact on
analytical workloads while loading the data with the highest throughput possible.
If you are using PolyBase, you need to define external tables in your
dedicated SQL pool before loading. PolyBase uses external tables to define
and access the data in Azure Storage. An external table is similar to a
database view. The external table contains the table schema and points to
data that is stored outside the dedicated SQL pool.
PolyBase can't load rows that have more than 1MB of data. When you put
data into the text files in Azure Blob storage or Azure Data Lake Store, they
must have fewer than 1,000,000 bytes of data. This byte limitation is true
regardless of the table schema.
Code Examples
/* 1. To access your Data Lake Storage account, you will need to create a
Database Master Key to encrypt your credential secret. You then create a Database S
coped Credential to store your secret. The Master Key is required to encrypt the cr
edential secret (Shared Access Signature) in the next step. */
/* IMP NOTE
External Tables are strongly typed.
This means that each row of the data being ingested must satisfy the table schema d
efinition. If a row does not match the schema definition, the row is rejected from
the load.
*/
------------------------------------------------------------------
/* Common errors
1. External table 'logdata' is not accessible because the location does not exist o
r it is used by another process. Here your Shared Access Signature is an issue.
-- In the SQL pool, we can use Hadoop drivers to mention the source
WITH (
LOCATION = '/parquet/*.parquet',--select all the parquet files from the folder
/*
A common error can come when trying to select the data, here you can get various er
rors such as MalformedInput
You need to ensure the column names map correctly and the data types are correct as
per the parquet file definition (Data types are embedded)
*/
A mistake that many people make when first exploring dedicated SQL Pools are to
use the service administrator account as the one used for loading data. Instead, it’s
better to create specific accounts assigned to different resource classes dependent on
the anticipated task. This will optimize load performance and maintain concurrency as
required by managing the available resource slots available within the dedicated SQL
Pool.
-- Never use the admin account for load operations (keep it only for monitoring and admi
n purposes)
-- Create a seperate user for load operations
-- This has to be run in the master database as we are adding a login and user
CREATE LOGIN user_load WITH PASSWORD = 'Azure@123';
3. Multiple options like Blob storage, Cosmos DB, Data Lake Gen2 etc are
provided. Select the appropriate one, here is Gen2
5. In order to fetch the data from the external table, we need to provide some
additional access to services which was not required when viewing the data as
admin in the Gen2 storage
6. Go to Access Control inside the Gen2 space, and add a role assignment with
Blob Data Contributor
7. Coming back to Synapse Studio, we will be able to see both dedicated SQL
pool data as well as external data
8. Now we can view all data and run SQL queries for any activity.
9. Since we want to load data, we can right-click on any file where we want to
append/ copy the data and click on New SQL script→Bulk Load
10. Fill in the required configuration settings and continuing forward, we will have a
SQL query auto-generated that can be used for copying the data
1. An external data source that points to the abfss path in ADLS Gen2 where the
Parquet files are located
3. An external table that defines the schema for the files, as well as the location,
data source, and file format
--Step 1
CREATE EXTERNAL DATA SOURCE log_data
WITH ( LOCATION = 'abfss://[email protected]',
CREDENTIAL = AzureStorageCredential,
TYPE = HADOOP
)
--Step 2
CREATE EXTERNAL FILE FORMAT parquetfile
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
-- Now create a normal table by selecting all of the data from the external table
1) Input
Dynamic schema handling is a powerful feature, and key to stream processing. Data
streams often contain data from multiple sources, with multiple event types, each with
a unique schema. To route, filter, and process events on such streams, ASA has to
ingest them all whatever their schema.
But the capabilities offered by dynamic schema handling come with a potential
downside. Unexpected events can flow through the main query logic and break it. As
an example, we can use ROUND on a field of type NVARCHAR(MAX) . ASA will implicitly
cast it to float to match the signature of ROUND . Here we expect, or hope, this field will
always contain numeric values. But when we do receive an event with the field set
to "NaN" , or if the field is entirely missing, then the job may fail.
2) Output
A job can have one or more outputs to continuously write data to.
When you design your Stream Analytics query, refer to the name of the output by using
the INTO clause. You can use a single output per job, or multiple outputs per streaming
job (if you need them) by adding multiple INTO clauses to the query.
WITH Step1 AS (
SELECT *
FROM input
PARTITION BY DeviceId
INTO 10
)
3) Query
The rich SQL like language support allows you to tackle scenarios such as parsing
complex JSON, filtering values, computing aggregates, performing joins, and even more
advanced use cases such as geospatial analytics and anomaly detection. We can also
extend this SQL language with JavaScript or C# user-defined functions (UDF) and
JavaScript user-defined-aggregates (UDA).
To achieve low latency stream processing, Azure Stream Analytics jobs perform all
processing in-memory. When running out of memory, the streaming job fails. The SU %
utilization metric describes the memory consumption of your workload.
One of the unique capability of Azure Stream Analytics job is to perform stateful
processing, such as windowed aggregates, temporal joins, and temporal analytic
functions. Each of these operators keeps state information.
The temporal window concept appears in several Stream Analytics query elements. The
following factors influence the memory used (part of streaming units metric)
SELECT count(*)
FROM input PARTITION BY PartitionId
GROUP BY PartitionId, clusterid, TumblingWindow (minutes, 5)
3. Temporal analytic functions: ISFIRST , TOPONE , LAST , and LAG with LIMIT
DURATION
✅
💻 DP-203 Notes by Neil Bagchi 135
✅ Windowing Functions
Windowing functions are operations performed against the data
contained within a temporal or time-boxed window. A window
contains event data along a timeline. Using windowing provides
a way to aggregate events over various time intervals
depending on specific window definitions.
Stream Analytics has native support for windowing functions. There are five kinds of
temporal windows to choose from: Tumbling, Hopping, Sliding, Session, and
Snapshot windows. You use the window functions in the GROUP BY clause of the query
syntax in your Stream Analytics jobs. You can also aggregate events over multiple
windows using the Windows() function.
1) Tumbling window
Tumbling window functions are used to
segment a data stream into distinct
time segments and perform a function
against them, such as in the example
below. The key differentiators of a
Tumbling window are that they repeat, do
not overlap, and an event cannot belong
to more than one tumbling window.
/*The query averages the engine temperature and speed over a two-second duration by add
ing TumblingWindow(Duration(second, 2)) to the query's GROUP BY clause. Then it selects
all telemetry data, including the average values from the previous step, and specifies
the anomalies as new fields
The query outputs all fields from the anomalies step into the powerBIAlerts output wher
e aggressivedriving = 1 or enginetempanomaly = 1 or oilanomaly = 1 for reporting. The q
uery also aggregates the average engine temperature and speed of all vehicles over the
past two minutes, using TumblingWindow(Duration(minute, 2)), and outputs these fields
to the synapse output.*/
WITH Averages AS (
SELECT
AVG(engineTemperature) averageEngineTemperature,
AVG(speed) averageSpeed
FROM
eventhub TIMESTAMP BY [timestamp]
GROUP BY
TumblingWindow(Duration(second, 2))
),
Anomalies AS (
select
t.vin,
t.[timestamp],
t.engineTemperature,
a.averageEngineTemperature,
a.averageSpeed,
t.engineoil,
t.accelerator_pedal_position,
t.brake_pedal_status,
t.transmission_gear_position,
(CASE WHEN a.averageEngineTemperature >= 405 OR a.averageEngineTemperature <= 1
5 THEN 1 ELSE 0 END) AS enginetempanomaly,
(CASE WHEN t.engineoil <= 1 THEN 1 ELSE 0 END) AS oilanomaly,
(CASE WHEN (t.transmission_gear_position = 'first' OR
t.transmission_gear_position = 'second' OR
t.transmission_gear_position = 'third') AND
t.brake_pedal_status = 1 AND
t.accelerator_pedal_position >= 90 AND
a.averageSpeed >= 55 THEN 1 ELSE 0 END) AS aggressivedriving
FROM eventhub t TIMESTAMP BY [timestamp]
INNER JOIN Averages a ON DATEDIFF(second, t, a) BETWEEN 0 And 2
),
VehicleAverages AS (
2) Hopping window
Hopping window functions hop forward in
time by a fixed period. It may be easy to
think of them as Tumbling windows that
can overlap. Events can belong to more
than one Hopping window result set.
3) Sliding window
Sliding windows, unlike Tumbling or
Hopping windows, output events only
for points in time when the content of
the window actually changes. In other
4) Session window
Session window cluster
together events that arrive at
similar times, filtering out
periods of time where there is
no data.
The following query measures user session length by creating a SessionWindow over
clickstream data with a timeoutsize of 5 seconds and a maximumdurationsize of 10 seconds
A session window begins when the first event occurs. If another event occurs within the
specified timeout from the last ingested event, then the window extends to include the
new event. Otherwise, if no events occur within the timeout, then the window is closed at
the timeout.
If events keep occurring within the specified timeout, the session window will keep
extending until the maximum duration is reached. The maximum duration checking
intervals are set to be the same size as the specified max duration. For example, if the
max duration is 10, then the checks on if the window exceeds the maximum duration will
happen at t = 0, 10, 20, 30, etc.
When a partition key is provided, the events are grouped together by the key and the
session window is applied to each group independently. This partitioning is useful for
cases where you need different session windows for different users or devices.
-- Output the count of events that occur within 2 minutes of each other with a maximum
duration of 60 minutes.
5) Snapshot window
Snapshot windows group events that have the same timestamp. Unlike other windowing
types, which require a specific window function, you can apply a snapshot window by
adding System.Timestamp() to the GROUP BY clause.
SU% Utilization
Percentage of memory that your job utilizes. If this metric is consistently over 80 percent,
the watermark delay is rising, and the number of backlogged events is rising, consider
increasing streaming units (SUs) and/or scale with query parallelization.
Runtime Error
The total number of errors related to query processing. Examine the activity or resource
logs and make appropriate changes to the inputs, query, or outputs.
Modern stream processing systems differentiate between event time also referred to as
application time, and arrival time. EVENT TIME is the time generated by the producer of the
event and typically contained in the event data as one of the columns. ARRIVAL TIME is the
time when the event was received by the event ingestion layer, for example, when the
event reaches Event Hubs.
Most applications prefer to use event time as it excludes possible delays associated with
transferring and processing of events. In-Stream Analytics, you can use the TIMESTAMP BY
Event Hubs is one of three types of message brokers available on Azure. Message
brokers act as intermediaries between event producers, such as mobile phone apps, and
event consumers, like dashboards or data processing pipelines.
Live Data Processing should be able to ingest high volumes of data, process these data using sufficient
processing power, and generate output data in real-time that will get stored in storage with high
bandwidth
An entity that reads data from an event hub is called a consumer, or a subscriber.
Each consumer group can independently seek and read data, from each partition, at
their own pace.
Event publishers are any app or device that can send out events using either HTTPS,
Advanced Message Queuing Protocol (AMQP) 1.0, or Apache Kafka.
Temporal Decoupling
Load Balancing
Event Hubs is able to handle sudden influxes of traffic than a directly coupled consumer
that needs to spend time processing each message. As consumers pull data at their own
rate, they avoid being overloaded at any given moment and can process any backlog
during moments of lower traffic
Partition
A partition is an ordered sequence of events that are held in an Event Hub Partitions can
be used to divide or prioritize work and ensure that certain types of data are physically
stored together for ease of processing and backup.
Auto-Inflate automatically scales the number of Throughput Units assigned to your
Standard Tier Event Hubs Namespace when your traffic exceeds the capacity of the
Throughput Units assigned to it. You can specify a limit to which the Namespace will
automatically scale.
Checkpointing
It is a process by which readers mark or commit their position within a partition event
sequence. Checkpointing is the responsibility of the consumer and occurs on a per-
partition basis within a consumer group.
📌 Pull Model
2. The second step is to create an event hub in that namespace. The following
parameters are required to create an event hub:
Event hub name - Event hub name that is unique within your subscription
Message retention - The number of days (1 to 7 for the standard tier) that
messages will remain available if the data stream needs to be replayed for any
reason. If not defined, this value defaults to 7. For Event
Hubs Premium and Dedicated, the maximum retention period is 90 days.
Incoming/Outgoing Bytes: The number of bytes sent to/received from the Event
Hubs service over a specified period.
Logs
Logs are events that occurred within
the system. They can contain
different kinds of data and may be
structured or free-form text with a
timestamp.
Traces
Changes
Changes are a series of events that occur in your Azure application and resources.