Intro To Apache Airflow

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

Apache Airflow

Introduction
Agenda
● What is Airflow
● What is a workflow
● Example of an Airflow workflow
● Background and the world before Airflow
● Purpose
● Terminologies
● Core Components
● Usages
● Demo
What is Airflow
● Apache Airflow is an open-source platform for programmatically
authoring, scheduling, and monitoring workflows.

● It allows you to define and manage complex data pipelines as


directed acyclic graphs (DAGs) of tasks and automate the process of
creating and updating data pipelines. It provides a rich web-based
interface for setting up, monitoring and managing workflow
execution, and an API for triggering and monitoring workflows.
What is a Workflow?

● A sequence of tasks
● Started on a schedule or triggered by an event
● Frequently used to handle big data processing pipelines

Example for an Airflow workflow


Example of an Airflow workflow

1. Download data
2. Send data to processing
3. monitor processing
4. Generate report
5. Send email
Background

A developer wants to run a job on schedule


● Cron job (Job scheduling)
● Python or bash script

Extract data from data


Start End
source A to storage B
Background
Business demands more data extractions from various sources

Solution. Develop more cron job

1 Extract data from


Start data source A to End
storage B

2 Extract data from


Start data source C to End
storage D
…. ….
n
Extract data from
Start data source E to End
storage F
Challenges with cron jobs
● Hard to scale
● Hard to monitor
● Hard to maintain
● Hard to maintaining dependencies
● Hard to manage jobs failures and timeouts
● Hard to manage deployments
Airflow advantages

Developers can programmatically:


● Author workflows
● Schedule workflows
● Monitor workflow
● Debug
● Scale easly
Airflow Terms
● Task

1 Extract data from


Start data source A to End
storage B

2 Extract data from


Start data source A to End
storage B
Airflow DAG
● A DAG (Directed Acyclic Graph) is used to define a workflow as a series of
tasks and how they interact with each other.
● Each task in a DAG represents a single operation in your workflow, such as
running a query, sending an email, or uploading a file.
● The relationships between tasks are defined by dependencies, where one task
can only run after another task has completed
Airflow Core components

Task
Execution Webserver Web UI
logs

Metadata
Scheduler Workers
database
Airflow usage
● Run and automate ETL pipelines
● Data ingestion pipelines
● Machine learning pipelines
● Predictive data pipelines
● General purpose scheduling
Airflow architecture
● Scheduler: Triggers scheduled workflows and submits tasks to executor to
run
● Executor: Manages tasks
● Worker: Runs the tasks
● Webserver:Supports the user interface
● Metadata database: Stores information about DAGs and tasks

You might also like