Summary Udemy
Summary Udemy
Summary Udemy
If you found “graph” word in question, and Gremlin API is one of the options, that
is the answer
That’s why I have explained partition in great detail, please make sure you
understand it well
Consistency level – You will see at least a few questions on this topic
If you see “most recent committed version” in question – Strong consistency is
the answer
If you see “hierarchical namespace” in question, and one of the options is Data
Lake, that is probably the right answer.
Blob Storage
Always use a general-purpose v2 account.
Low latency
Block blob storage accounts only support premium tiers, which are more
expensive than general-purpose v2 account types
Azure Data Lake
Azure Data Lake is a big data storage solution that allows you to store data of
any type and size.
Azure Data Lake Storage Gen2 has the capabilities of both Azure Blob Storage
and Azure Data Lake Storage Gen1
Allows you to use OData and Language Integrated Query (LINQ) to query data.
Further study
Cosmos SQL API
Allows you to use SQL to query data as JavaScript Object Notation (JSON)
documents.
You can use .NET to query Cosmos DB data that uses the SQL API.
Each entity (row) can store a varying number of attributes (fields). This allows
different vendors to upload products with varying attributes.
Further study
Cosmos DB Partition key
This partition key will distribute all the documents evenly across logical
partitions.
Further Study::
Cosmos DB CLI
Eventual:
o Lowest latency
Session: the same user is guaranteed to read the same value within a single
session.
o Even before replication occurs, the user that writes the data can read the same
value.
o The user at the same location does not mean, they will be in the same session
Automatic failover: This is used to automatically failover Cosmos DB in disaster
recovery scenarios.
Shared Access Signature
SAS delegates access to blob containers in a storage account with granular
control over how the client accesses your data.
You can define a SAS token to allow access only to a specific blob container with
a defined expiration.
Further study:
Shared Key authorization (Access Keys)
Gives full administrative access to storage accounts, sometimes more access
than necessary.
Further study:
Azure Managed Disks - These are virtual hard disks intended to be used as a part of
Virtual Machine (VM) related storage.
Data Factory Integration runtime is important topic, you should know difference
between diff runtime, and in which scenario which runtime should be used.
Self-hosted runtime
When you use the Copy activity to copy data between Azure and a private network, you
must use the self-hosted integration runtime.
Azure integration runtime.
This is required when you need to copy data between Azure and public cloud services.
Azure-SSIS integration runtime.
This is required when you want to run existing SSIS packages natively.
Linked Service
A linked service stores the connection information from the source dataset, like user
credentials, server address and database name.
Linked service will be used by the dataset.
Activity
An activity is the task that is executed, like copying data or performing a lookup.
Activities use datasets to read or write data as the result of a pipeline.
Pipeline
A pipeline is a group of activities linked together to form a data pipeline.
Triggers
A tumbling window can define the starting time in the WindowStart setting and the
ending time in the WindowEnd setting, defining a time frame to run the data pipeline.
You can have data automatically generate pie charts and bar charts when you run a
notebook.
You can override the default language by specifying the language magic
command %<language> at the beginning of a cell.
Also, you should understand the difference between the event hub and IoT hub.
one question can be on Reference data, which can be put on SQL Server or Blob.
Stream Analytics is a big data analytics solution that allows you to analyze real-time
events simultaneously.
The input data source can be an event hub, an IoT hub, blob storage or SQL Server.
The reference input is data that never or rarely changes. Reference data can be saved in
Azure SQL Database or Blob storage
Event Hubs
Event Hub is an Azure resource that allows you to stream big data to the cloud.
Event Hub accepts streaming telemetry data from other sources. It is basically a big
data pipeline. It allows you to capture, retain, and replay telemetry data
A Stream Analytics job can read data from Event Hubs and store the transformed data
in a variety of output data sources, including Power BI.
IOT Hub
IoT Hub is an Azure resource that allows you to stream big data to the cloud.
It accepts streaming data over HTTPS, AMQP, and Message Queue Telemetry Transport
(MQTT).
A Stream Analytics job can read data from IOT Hubs and store the transformed data in
a variety of output data sources, including Power BI.
Windowing function
Tumbling window
Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time
intervals.
However, they do not check the time duration between events and do not filter out
periods of time when no events are streamed.
Hopping windows
Hopping windows are a series of fixed-sized and contiguous time intervals. They hop
forward by a specified fixed time. If the hop size is less than a size of the window,
hopping windows overlap, and that is why an event may be part of several windows.
Hopping windows do not check the time duration between events and do not filter out
periods of time when no events are streamed.
Sliding windows
Sliding windows are a series of fixed-sized and contiguous time intervals. They produce
output only when an event occurs, so you can filter out periods of times where no
events are streamed.
However, they may overlap and that is why an event may be included in more than one
window. Sliding windows also do not check the time duration between events.
Session windows
Session windows begin when the defect detection event occurs, and they continue to
extend, including new events occurring within the set time interval (timeout).
If no further events are detected, then the window will close. The window will also close
if the maximum duration parameter is set for the session window, and then a new
session window may begin.
The session window option will effectively filter out periods of time where no events are
streamed. Each event is only counted once.
Other Concepts
Event Grid
Event Grid is a publish-subscribe platform for events. Event publishers send the events
to Event Grid. Subscribers subscribe to the events they want to handle.
Azure Relay
Azure Relay allows client applications to access on-premises services through Azure.
HDInsight
HDInsight is a streaming technology that allows you to use C#, F#, Java, Python, and
Scala.
WebJob
WebJob runs in the context of an Azure App Service app.
You can use C#, Java, Node.js, PHP, Python to implement WebJobs.
Function App
A function app is similar to a WebJob in that it can be invoked on a schedule or by a
trigger.
You can use many different languages to create a function in a function app.
However, you cannot use a SQL-like language.
SQL Server monitoring have been removed from syllabus on 31st July, but still questions are
coming in exam. So, please make sure you do not ignore this.
Log Analytics
Log Analytics allows you to write queries to analyze logs in Azure.
SQL Server Monitoring
Query Performance Insight allows you to view database queries that consume the most
resources and those that take the longest to run.
It does not suggest when to create or drop and index.
Azure SQL Database – Diagnostic logging options
SQLInsights gathers performance information and provides recommendations.
QueryStoreRuntimeStatistics provides information about CPU usage and query duration. Basic
metrics provides CPU and DTU usage and limits.
QueryStoreWaitStatistics. This provides information about the resources that caused queries to
wait, such as the CPU, logs, or locks.
DatabaseWaitStatistics. This provides information about the time a database spent on waiting.
SQL Database Advisor
It allows you to review recommendations for creating and dropping indexes, fixing schemas, and
parameterizing queries.
Azure Advisor: provides recommendations for availability, security, performance, and cost. It
integrates with SQL Database Advisor to provide recommendations for creating and dropping
indexes.
Query Store - provides statistics on query performance. It helps you identify performance
differences that are caused by query changes. It is disabled by default.
SET SHOWPLAN_TEXT ON
This statement allows you to display query execution information without actually executing the
query. This statement is intended for applications that display text.
SET SHOWPLAN_ALL ON
This statement allows you to display query execution information without actually executing the
query. This statement is intended for applications that can display text. It provides additional
columns of information for each row that is output.
sys.dm_pdw_exec_requests
This view returns all queries that are currently running or that were recently running. You can
use the following SQL statement to return the top 10 longest running queries:
SELECT TOP 10 * FROM sys.dm_pdw_exec_requests ORDER BY total_elapsed_time DESC;
LABEL
LABEL option to assign a comment to the query. This adds a label to the query. For example,
you can add the label to a query as follows:
You can then easily locate the query's execution steps with the following statement:
In this section, mostly you will be asked which metrics you will monitor in particular
given situation.
What are diff types of log you will be sending to Log Analytics.
It is good to be aware of commonly used metrics and logs in Diagnostic settings and
Metrics page of Monitoring.
Ganglia
Default metrics available for Databricks.
How to transfer Databricks logs to Log Analytics/Azure Monitor?
You should use a third-party library to transfer Azure Databricks metrics to Azure
Monitor because it is not supported natively at the time of writing.
You should use Azure Log Analytics workspace as the target destination for uploaded
Azure Databricks metrics in Azure Monitor. Each workspace has its own data repository,
and data sources like Azure Databricks can be configured to store their metrics in a
particular Azure Log Analytics workspace.
Solution for sending application metrics to Azure Monitor?
Dropwizard is a Java library. Spark, which is the cluster engine that is used to run
Databricks, uses a configurable metrics system that is based on the Dropwizard Metrics
Library.
Diagnostic logs
You should configure diagnostics logs to send data to a blob storage account.
By default, Azure Data Pipeline stores run-data for only 45 days. To store the data longer
than that, you must configure diagnostics logs. With diagnostics logs, you can choose
to store the run-data in a blob storage account, an event hub, or a Log Analytics
workspace.
You can query using KQL in the Log Analytics workspace tables.
ADFPipelineRun table contains rows for status changes like InProgress and Succeeded.
AzureMetrics contains metrics like PipelineSucceededRuns
However, Azure Data Factory data retention is limited to 45 days. You need to use Azure
Monitor for longer retention.
You will see one question on TTL And maybe 1 or 2 questions on the access tier, both
topics are very easy, you should not miss these questions.
Time to live TTL, Azure Cosmos DB provides the ability to delete items automatically
from a container after a certain time period.
By default, you can set time to live at the container level and override the value on a per-
item basis.
After you set the TTL at a container or at an item level, Azure Cosmos DB will
automatically remove these items after the time period, since the time they were last
modified.
You should keep the SU metric below 80 percent. This allows Streaming Analytics to
account for usage spikes
ExpressRoute.
This creates a dedicated link between your on-premises datacenter and Azure. This
improves performance when copying data to Azure.
AD Connect allows you to synchronize user accounts between on-premises AD and
Azure AD.
Access Tiers
The Archive tier is optimized for storing data that is rarely accessed and that is kept for
at least 180 days.
The Cool tier is optimized for storing data that is accessed infrequently and that is kept
for at least 30 days.
You can also use a Copy Activity in Azure Data Factory with the copy method set to
PolyBase. This option creates and executes the Polybase commands automatically.
This also offers the fastest performance.
The following options are not parallel operations and are thus slower:
BCP
SSIS
Azure Data Factory using a Copy Activity and the bulk insert option