Data Engineering Roadmap uYdSPm5q

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Data Engineering Roadmap

• Programming Languages
• Python (Recommended)

• Java

• Scala

• Data Exploration Libraries (Python Only)


• Pandas

• NumPy

• Matplotlib

• Operating Systems & Scripting


• Linux/Unix Commands

• Shell Scripting

• Cron Jobs

• Data Structures & Algorithms (Easy-Medium Level Only)


• Arrays

• Strings

• Linked List

• Stack

• Queue

• Tree (Basics)

• Graph (Basics)

• Dynamic Programming

• Searching

• Sorting

• Database Management Systems


• Understand RDBMS and its use cases

• Schema Types
• ER Diagram

• ACID Properties

• Transactions

• Concurrency Control

• Deadlock

• Indexing

• Hashing

• Normalization Forms

• Views

• Stored Procedures

• SQL
• Basics Of DDL, DML, DCL

• All Types Of Joins

• Subqueries

• Group By

• Case-When Statement

• Common Table Expression (With Clause)

• Window Functions

• Pivoting

• BigData Terminologies
• What is BigData?

• 5 V’s of BigData

• Distributed Computation

• Distributed Storage

• Vertical vs Horizontal Scaling

• Commodity Hardware

• Clusters

• File formats
a. CSV

b. JSON

c. AVRO
d. Parquet

e. ORC

• Type of Data
a. Structured

b. Unstructured

c. Semi-structured

• Data Warehousing
• OLAP vs OLTP

• Dimension Tables

• Fact Tables

• Star Schema

• Snowflake Schema

• Warehouse Designing Questions

• Slowly Changing Dimensions (SCD)

• BigData Frameworks
• Apache Hadoop (Architecture Understanding Most Important)
a. HDFS

b. Map-Reduce (Coding part not needed)

c. Yarn

• Apache Hive
• How to load data in different file formats

• Internal Tables

• External Tables

• Querying table data stored in HDFS

• Partitioning

• Bucketing

• Map-Side Join

• Sorted-Merge Join

• UDFs in Hive

• SerDe in Hive

• Apache Spark (Most Important)


• Spark Core

• Spark SQL

• Spark Streaming

• Apache Flink ( Real-Time Data Processing )

• Apache SQOOP

• Apache NIFI

• Apache FLUME

• Schedulers/Workflow Managers
• Apache Airflow

• Apache NIFI

• Azkaban

• NoSQL Databases
• HBase

• Cassandra

• ElasticSearch

• MongoDB

• Messaging Queue
• Apache Kafka

• Dash Boarding Tools


• Tableau

• PowerBI

• Grafana

• Kibana

• BigData Services in Cloud (AWS)


• On-demand Machines
• AWS EC2

• Access Management
• AWS IAM

• For Storing and Accessing Credentials


• AWS Secret Manager

• Distributed File Storage


• AWS S3

• Transactional Database Services


• AWS RDS

• AWS Athena

• Data Warehousing Service


• AWS Redshift

• NoSQL Database Services


• AWS Dynamo

• Serverless
• AWS Lambda

• ETL Services
• AWS Glue

• Scheduler
• AWS CloudWatch

• Distributed Data Computation


• AWS EMR

• Messaging Queue
• AWS SNS

• AWS SQS

• Real-Time Data Processing


• AWS Kinesis

You might also like