150 Data Engineering Interview Questions PDF
150 Data Engineering Interview Questions PDF
150 Data Engineering Interview Questions PDF
111
33 All Interview Questions
The interview questions are roughly structured like the sections in the ”Basic data Engi-
neering Skills” part. This makes it easier to navigate this document. I still need to sort
them accordingly.
SQL DBs
• What is the difference between Clustered Index and Non-Clustered Index - with
examples?
The Cloud
• What is serverless
• How do you move from the ingest layer to the Cosumption layer? (In Serverless)
114
Linux
• What is crontab
Big Data
Kafka
• What is a topic
• How do you know if all messages in a topic have been fully consumed
• What is a producer
Coding
• Explain immutability
• What are AWS Lambda functions and why would you use them
115
NoSQL DBs
• What’s a columnstore
Hadoop
• What is HDFS
Lambda Architecture
• Can you sync the batch and streaming layer and if yes how
Python
116
• What is a data warehouse
APIs (REST)
• What is idempotency
Apache Spark
• What’s an RDD
• What is a dataframe
• What is a dataset
• What is Parquet
• What’s Avro
117
MapReduce
• What is a combiner
• What is a container
Data Pipelines
Airflow
• How to branch?
DataViszualization
• What’s a BI tool
118
Security/Privacy
• What is Kerberos
• What is a firewall
• Whats GDPR?
• What’s anonymization
Distrubuted Systems
• how clusters reach consensus (the answer was using consensus protocols like Paxos
or Raft). Good I didnt have to explain paxos
• What is the cap theorem / explain it (What factors should be considered when
choosing a DB?)
• How to choose right storage for different data consumers? It’s always a tricky
question
Apache Flink
• Flink vs Spark?
GitHub
Dev/Ops
119
• What is continuous deployment
• Difference CI/CD
Development / Agile
• What is Scrum
• What is OKR
120