Summary Udemy

Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1of 20

Summary of Data Engineering Profile and Technologies

Summary of Data Engineering Profile and Technologies


Cosmos DB
5 APIs are very important, you will see a few questions based on these
 You will get a scenario, and you will have to choose the best API for that given
scenario

 If you found “graph” word in question, and Gremlin API is one of the options, that
is the answer

 If SQL Like query is given in question - SQL API is the answer


Partition is the next most important topic, you will definitely see a few questions based
on this.
 You will get the scenario and you will have to choose the right partition key.

 That’s why I have explained partition in great detail, please make sure you
understand it well
Consistency level – You will see at least a few questions on this topic
 If you see “most recent committed version” in question – Strong consistency is
the answer

 And if you see “Lowest Latency” in question – “Eventual consistency” is the


answer
Cosmos CLI is also important, you may see 1 or 2 question
 You will see a half-filled query to create a cosmos DB account, and you will have
to choose the right option drop-down list to complete this query.
Data Lake
 This is not a very important topic for the exam, You may see 1 question

 If you see “hierarchical namespace” in question, and one of the options is Data
Lake, that is probably the right answer.
Blob Storage
 Always use a general-purpose v2 account.

- This account type incorporates all general-purpose v1 features, including blob


storage.

- It delivers the lowest per-gigabyte capacity prices for Azure storage.


Block Blob Storage
 This is a specialized account type used to store block blobs and appends blobs.

 Low latency

 Higher transaction rates.

 Block blob storage accounts only support premium tiers, which are more
expensive than general-purpose v2 account types
Azure Data Lake
 Azure Data Lake is a big data storage solution that allows you to store data of
any type and size.

 Repository for Big Data analytics workloads.

 Azure Data Lake Storage Gen2 has the capabilities of both Azure Blob Storage
and Azure Data Lake Storage Gen1

 Supports hierarchical namespaces.


Cosmos Table API
 Product with different number of attributes can be saved

 Allows you to use OData and Language Integrated Query (LINQ) to query data.

 You can issue LINQ queries with .NET to query data

 Does not support SQL-like queries from web applications.

 Further study
Cosmos SQL API
 Allows you to use SQL to query data as JavaScript Object Notation (JSON)
documents.

 You can use .NET to query Cosmos DB data that uses the SQL API.

 Supports a schema-less data store.


Mongo API
 This API does not allow you to use SQL-like queries to access and filter data.
Graph API
 This API does not support SQL-like queries.
 This API uses the Gremlin Graph Traversal Language to query data from a graph
database.
Table Storage
 Azure Table storage uses NoSQL, which allows you to store keys and attributes
in a schema-less data store.

 This is similar to Cosmos DB with the Table API.

 Each entity (row) can store a varying number of attributes (fields). This allows
different vendors to upload products with varying attributes.

 You can also use .NET to query the data.

 Further study
Cosmos DB Partition key
 This partition key will distribute all the documents evenly across logical
partitions.

 Further Study::

Cosmos DB CLI

 Use GlobalDocumentDB to provision a Cosmos DB with the SQL API


Cosmos DB consistency level
 Strong: This level is guaranteed to return the most recent committed version of a
records.

 Eventual:

o Lowest latency

o No guarantee of reading operations using the latest committed write.

 Session: the same user is guaranteed to read the same value within a single
session.

o Even before replication occurs, the user that writes the data can read the same
value.

o The user at the same location does not mean, they will be in the same session
Automatic failover: This is used to automatically failover Cosmos DB in disaster
recovery scenarios.
Shared Access Signature
 SAS delegates access to blob containers in a storage account with granular
control over how the client accesses your data.

 You can define a SAS token to allow access only to a specific blob container with
a defined expiration.

 Further study:
Shared Key authorization (Access Keys)
 Gives full administrative access to storage accounts, sometimes more access
than necessary.

 Shared keys could be regenerated

 They do not expire automatically

 Further study:
Azure Managed Disks - These are virtual hard disks intended to be used as a part of
Virtual Machine (VM) related storage.

Data Factory Integration runtime is important topic, you should know difference
between diff runtime, and in which scenario which runtime should be used.

Self-hosted runtime is more important.


Integration Runtime
The integration runtime is the execution environment that provides the compute
infrastructure for Data Factory.

Self-hosted runtime
When you use the Copy activity to copy data between Azure and a private network, you
must use the self-hosted integration runtime.
Azure integration runtime.
This is required when you need to copy data between Azure and public cloud services.
Azure-SSIS integration runtime.
This is required when you want to run existing SSIS packages natively.
Linked Service
A linked service stores the connection information from the source dataset, like user
credentials, server address and database name.
Linked service will be used by the dataset.
Activity
An activity is the task that is executed, like copying data or performing a lookup.
Activities use datasets to read or write data as the result of a pipeline.
Pipeline
A pipeline is a group of activities linked together to form a data pipeline.
Triggers
A tumbling window can define the starting time in the WindowStart setting and the
ending time in the WindowEnd setting, defining a time frame to run the data pipeline.

Manual trigger – allow you to manually start pipelines

Schedule trigger – schedule execution of pipeline


Databricks
This is an Apache Spark-based technology that allows you to run code in notebooks.

Code can be written in SQL, Python, Scala, and R.

You can have data automatically generate pie charts and bar charts when you run a
notebook.
You can override the default language by specifying the language magic
command %<language> at the beginning of a cell.

Windowing functions will be going to dominate this section in the exam.

Also, you should understand the difference between the event hub and IoT hub.

one question can be on Reference data, which can be put on SQL Server or Blob.
Stream Analytics is a big data analytics solution that allows you to analyze real-time
events simultaneously.
The input data source can be an event hub, an IoT hub, blob storage or SQL Server.

The reference input is data that never or rarely changes. Reference data can be saved in
Azure SQL Database or Blob storage
Event Hubs
Event Hub is an Azure resource that allows you to stream big data to the cloud.
Event Hub accepts streaming telemetry data from other sources. It is basically a big
data pipeline. It allows you to capture, retain, and replay telemetry data

It accepts streaming data over HTTPS and AMQP.

A Stream Analytics job can read data from Event Hubs and store the transformed data
in a variety of output data sources, including Power BI.
IOT Hub
IoT Hub is an Azure resource that allows you to stream big data to the cloud.

It supports per-device provisioning.

It accepts streaming data over HTTPS, AMQP, and Message Queue Telemetry Transport
(MQTT).

A Stream Analytics job can read data from IOT Hubs and store the transformed data in
a variety of output data sources, including Power BI.
Windowing function
Tumbling window
Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time
intervals.

Each event is only counted once.

However, they do not check the time duration between events and do not filter out
periods of time when no events are streamed.
Hopping windows
Hopping windows are a series of fixed-sized and contiguous time intervals. They hop
forward by a specified fixed time. If the hop size is less than a size of the window,
hopping windows overlap, and that is why an event may be part of several windows.

Hopping windows do not check the time duration between events and do not filter out
periods of time when no events are streamed.
Sliding windows
Sliding windows are a series of fixed-sized and contiguous time intervals. They produce
output only when an event occurs, so you can filter out periods of times where no
events are streamed.

However, they may overlap and that is why an event may be included in more than one
window. Sliding windows also do not check the time duration between events.
Session windows
Session windows begin when the defect detection event occurs, and they continue to
extend, including new events occurring within the set time interval (timeout).

If no further events are detected, then the window will close. The window will also close
if the maximum duration parameter is set for the session window, and then a new
session window may begin.

The session window option will effectively filter out periods of time where no events are
streamed. Each event is only counted once.

Other Concepts
Event Grid
Event Grid is a publish-subscribe platform for events. Event publishers send the events
to Event Grid. Subscribers subscribe to the events they want to handle.

Azure Relay
Azure Relay allows client applications to access on-premises services through Azure.

HDInsight
HDInsight is a streaming technology that allows you to use C#, F#, Java, Python, and
Scala.

It does not allow you to use a SQL-like language.

WebJob
WebJob runs in the context of an Azure App Service app.

It can be invoked on a schedule or by a trigger.

You can use C#, Java, Node.js, PHP, Python to implement WebJobs.

However, you cannot use a SQL-like language.

Function App
A function app is similar to a WebJob in that it can be invoked on a schedule or by a
trigger.

You can use many different languages to create a function in a function app.
However, you cannot use a SQL-like language.
SQL Server monitoring have been removed from syllabus on 31st July, but still questions are
coming in exam. So, please make sure you do not ignore this.

Log Analytics
Log Analytics allows you to write queries to analyze logs in Azure.
SQL Server Monitoring
Query Performance Insight allows you to view database queries that consume the most
resources and those that take the longest to run.
It does not suggest when to create or drop and index.
Azure SQL Database – Diagnostic logging options
SQLInsights gathers performance information and provides recommendations.
QueryStoreRuntimeStatistics provides information about CPU usage and query duration. Basic
metrics provides CPU and DTU usage and limits.

QueryStoreWaitStatistics. This provides information about the resources that caused queries to
wait, such as the CPU, logs, or locks.

DatabaseWaitStatistics. This provides information about the time a database spent on waiting.
SQL Database Advisor
It allows you to review recommendations for creating and dropping indexes, fixing schemas, and
parameterizing queries.
Azure Advisor: provides recommendations for availability, security, performance, and cost. It
integrates with SQL Database Advisor to provide recommendations for creating and dropping
indexes.
Query Store - provides statistics on query performance. It helps you identify performance
differences that are caused by query changes. It is disabled by default.
SET SHOWPLAN_TEXT ON
This statement allows you to display query execution information without actually executing the
query. This statement is intended for applications that display text.
SET SHOWPLAN_ALL ON
This statement allows you to display query execution information without actually executing the
query. This statement is intended for applications that can display text. It provides additional
columns of information for each row that is output.

sys.dm_pdw_exec_requests
This view returns all queries that are currently running or that were recently running. You can
use the following SQL statement to return the top 10 longest running queries:
SELECT TOP 10 * FROM sys.dm_pdw_exec_requests ORDER BY total_elapsed_time DESC;
LABEL
LABEL option to assign a comment to the query. This adds a label to the query. For example,
you can add the label to a query as follows:

SELECT * FROM FactStoreSales OPTION ( LABEL = 'Q4' );

You can then easily locate the query's execution steps with the following statement:

SELECT * FROM sys.dm_pdw_exec_requests WHERE [label] = 'Q4';


VIEW DATABASE STATE permission.
This permission is required to access Dynamic Management Views (DMVs), which allow you to
investigate query execution in Azure SQL Data Warehouse. The view that contains logins is
sys.dm_pdw_exec_sessions. It actually contains the last 10,000 logins.

VIEW DEFINITION permission.


This allows the employee to view metadata of an object. For example, the employee can view
table metadata in the sys.objects catalog.

ALTER ANY CONNECTION permission


This allows the employee to manage the database server.

ALTER ANY USER permission


This allows the employee to manage database users.

In this section, mostly you will be asked which metrics you will monitor in particular
given situation.

What are diff types of log you will be sending to Log Analytics.
It is good to be aware of commonly used metrics and logs in Diagnostic settings and
Metrics page of Monitoring.
Ganglia
Default metrics available for Databricks.
How to transfer Databricks logs to Log Analytics/Azure Monitor?
You should use a third-party library to transfer Azure Databricks metrics to Azure
Monitor because it is not supported natively at the time of writing.

You should use Azure Log Analytics workspace as the target destination for uploaded
Azure Databricks metrics in Azure Monitor. Each workspace has its own data repository,
and data sources like Azure Databricks can be configured to store their metrics in a
particular Azure Log Analytics workspace.
Solution for sending application metrics to Azure Monitor?
Dropwizard is a Java library. Spark, which is the cluster engine that is used to run
Databricks, uses a configurable metrics system that is based on the Dropwizard Metrics
Library.

Diagnostic logs
You should configure diagnostics logs to send data to a blob storage account.

By default, Azure Data Pipeline stores run-data for only 45 days. To store the data longer
than that, you must configure diagnostics logs. With diagnostics logs, you can choose
to store the run-data in a blob storage account, an event hub, or a Log Analytics
workspace.

You can query using KQL in the Log Analytics workspace tables.
ADFPipelineRun table contains rows for status changes like InProgress and Succeeded.
AzureMetrics contains metrics like PipelineSucceededRuns

Azure Data Factory inbuilt monitoring


Azure Data Factory includes monitoring capabilities for your pipeline runs with
execution metrics and pipeline status. You can define alerts directly in Azure Data
Factory Monitor.

However, Azure Data Factory data retention is limited to 45 days. You need to use Azure
Monitor for longer retention.

Azure Stream analytics – Metrics to monitor


SU monitoring
You should use the SU % utilization metric. This metric indicates how much of the
provisioned SU % is in use. If this indicator reaches 80%, there is a high probability that
the job will stop.
Input events metric - This metric counts the number of records deserialized from the
input events.
Runtime errors metric - This metric is the total number of errors during query
processing.
Azure Network Watcher
It is a centralized tool for monitoring Azure networking.
SQL Server optimization has been removed from the syllabus on 31st July 2020, but you
can still see questions based on this topic.
There are many questions in this section that will check once again your Azure SQL
Data warehouse distribution knowledge, if you understand the difference between
Round-robin, hash, and replicated.

You will see one question on TTL And maybe 1 or 2 questions on the access tier, both
topics are very easy, you should not miss these questions.

Time to live TTL, Azure Cosmos DB provides the ability to delete items automatically
from a container after a certain time period.

By default, you can set time to live at the container level and override the value on a per-
item basis.

After you set the TTL at a container or at an item level, Azure Cosmos DB will
automatically remove these items after the time period, since the time they were last
modified.

Time to live value is configured in seconds.


Stream analytics best practices
You should start with six SUs for queries that do not use PARTITION BY. This is
considered a best practice.
You should allocate more SUs than you need. This is another best practice.

You should keep the SU metric below 80 percent. This allows Streaming Analytics to
account for usage spikes
ExpressRoute.
This creates a dedicated link between your on-premises datacenter and Azure. This
improves performance when copying data to Azure.
AD Connect allows you to synchronize user accounts between on-premises AD and
Azure AD.
Access Tiers
The Archive tier is optimized for storing data that is rarely accessed and that is kept for
at least 180 days.

The Cool tier is optimized for storing data that is accessed infrequently and that is kept
for at least 30 days.

The Hot tier is optimized for storing data that is accessed frequently.


Azure SQL Server
A columnstore index stores column values and increases the aggregate queries that
use theses indexes.
Memory-optimized tables store all data and schema in memory, increasing the
performance for queries.
Partitioned view can be used to split large tables across multiple smaller tables.
Non-clustered index is generally used to increase filter performance and to lookup rows
with specific values.
A heap is a table without a clustered index with table data stored without any specific
order. Every query in a heap should perform a table scan. A full table scan has better
performance than a table with a clustered index for small tables.

Azure SQL Data Warehouse


Fasted loading process is Polybase.
You can write and run PolyBase T-SQL commands. This is a fully parallel operation and
is the fastest option.

You can also use a Copy Activity in Azure Data Factory with the copy method set to
PolyBase. This option creates and executes the Polybase commands automatically.
This also offers the fastest performance.
The following options are not parallel operations and are thus slower:
 BCP

 SQL BulkCopy API

 SSIS

 Azure Data Factory using a Copy Activity and the bulk insert option

You might also like