Hadoop and Related Tools
Hadoop and Related Tools
Hadoop and Related Tools
1
Course Plan
Module Titles
Module 1 – Reliable, Scalable Data-Intensive Applications
Module 2 – Storing Data
Current Focus: Module 3 – Hadoop and Related Tools
Module 4 – Just Enough Scala for Spark
Module 5 – Introduction to Spark
Module 6 – Big Data Analytics and Processing Streaming Data
Module 7 – Spark 2
Module 8 – Tools for Moderately-Sized Datasets
Module 9 – Data Pipelines
Module 10 – Working with Distributed Databases
Module 11 – Big Data Architecture
Module 12 – Term Project Presentations (no content)
2
Learning Outcomes for this Module
3
Topics for this Module
4
Module 3 – Section 1
Introduction to Hadoop
5
Hadoop 3 Architecture
Source: Hortonworks
6
Module 3 – Section 2
7
The Hadoop Distributed File System (HDFS)
• Stores very large files as blocks
• Designed for commodity servers
• Designed to scale easily and effectively
• Achieves reliability through replication
• But is a poor choice for:
− Low latency data access
− Storing lots of small files
− Multiple writers
− Modification within files
8
HDFS (cont’d)
9
HDFS Concepts
• Blocks
− Data is stored in large blocks (default 128MB) to reduce seek
time
− Files consist of blocks and so can be larger than a single disk
and split across many servers
• Namenodes
− Master server that manages the filesystem namespace and
metadata
− Single point of failure
10
HDFS Concepts (cont’d)
• Datanodes
− Slave servers where data is stored
• HDFS Federation
− Can have multiple namenodes, each managing different
filesystem subtrees
• HDFS High Availability
− Active-passive configuration with automatic failover (takes
about a minute)
− Can alternatively configure a new namenode (takes half an
hour or more)
11
Isn’t This Just Grid Computing?
12
Module 3 – Section 3
13
YARN
14
Module 3 – Section 4
MapReduce
15
MapReduce
• A simple parallel programming paradigm
• 2-3 phases
− Map: Split the problem into chunks and tackle them
− (Shuffle): Organize the results into matching groups
− Reduce: Combine the results of each group
• Not all problems can be carved up this way
• Works with a variety of programming languages
• System “looks after” message passing, scheduling on
machines, managing assembly of partial results
16
MapReduce (cont’d)
sort
copy
split 0 map merge
reduce part 0
split 1 map
reduce part 1
split 2 map
17
MapReduce (cont’d)
• Mappers get (key: value) from files and emit (key, value)
where value has been mapped to something else
• Shuffle and Sort: groups output from above by key so we
now have (key: [value, value, …])
• Never need to deal with the shuffle/sort phase explicitly; it’s
handled by the framework
• Reducers take sorted (key: [values]) as input and emit (key:
aggregated_value)
18
Module 3 – Section 5
19
Ambari
• Provision a Hadoop cluster
• Manage and monitor the cluster
20
Sentry
21
Ranger
• Vision is to provide comprehensive security across the
Hadoop ecosystem
• Today provides fine-grained authorization and auditing for:
− Hadoop
− Hive
− HBase
− Storm
− Knox
• Centralized web application
− Policy administration
− Audit
− Reporting
22
Falcon
23
ZooKeeper
• Centralized coordination service for distributed systems
• Provides centralized services such as maintaining
configuration information, naming, distributed
synchronization and group services
24
Azkaban
• Workflow scheduler
25
Module 3 – Section 6
Hadoop Storage
26
Avro
27
Parquet RCFiles & ORCFiles
28
Module 3 – Section 7
29
Flume
• Can handle arbitrary types of streaming data
• Can configure the reliability level of the data collection
based on how important the data is and how timely it needs
to be
• Can have it batch the data up for transfer or move it as soon
as it becomes available
• Can do some lightweight transformations that make it handy
for ETL
30
Sqoop
• NoSQL database
• Supports large (potentially sparse) tables
• Based on Google BigTable
• Stores data in columns but isn’t really column-oriented
• Excels at key-based access to a single cell or range of cells
31
Pig
32
Pig Latin Example
users = LOAD 'users’ AS (name, age);
filteredUsers = FILTER Users BY age >= 18 AND age <=
25;
pages = LOAD 'pages’ AS (user, url);
joinedUserPages = JOIN filteredUsers BY name, Pages BY
user;
groupedUserPages = GROUP joinedUserPages BY url;
numberOfClicks = FOREACH groupedUserPages GENERATE
GROUP, COUNT(joinedUserPages) AS clicks;
sortedNumberOfClicks = ORDER numberOfClicks BY clicks
DESC;
top5 = LIMIT sortedNumberOfClicks 5;
STORE top5 INTO 'top5sites';
33
Kafka
34
Module 3 – Section 8
35
HCatalog
• Provides metadata services (and REST interface)
• Stored in a relational database outside Hadoop
• Makes Hive possible
36
Hive
• Relational-like table store in Hadoop
• Hive table consists of a schema stored in HCatalog plus
data stored in HDFS
• Converts HiveQL commands into batch jobs
• Can delete rows and add small batches of inserts but
changes are not in-place
• Limited index and subquery capability
• Supports both primitive (e.g. BOOLEAN, INT, FLOAT,
VARCHAR, BINARY, etc.) and complex (ARRAY, MAP,
STRUCT, UNION)
37
HBase
• NoSQL database
• Supports very large (potentially sparse) tables
• Latency in milliseconds
• Based on Google BigTable
• Stores data in column families
• Excels at key-based access to a single cell or range of cells
38
Drill
40
Module 3 – Section 9
Resources
41
Resources
• White, Tom. Hadoop: The Definitive Guide 4th Edition.
O’Reilly. 2015.
• Grover et. al. Hadoop Application Architectures. O’Reilly.
2015.
• Sammer, Eric. Hadoop Operations. O’Reilly. 2012.
• Kunigk & George. Hadoop in the Enterprise: Architecture.
O’Reilly. 2018.
• Holmes, Alex. Hadoop in Practice. Manning. 2014.
• Capriolo & Wampler. Programming Hive. O’Reilly. 2012.
42
Resources (cont’d)
• Free online PIG book:
• Google paper on MapReduce:
43
Module 3 – Section 10
Homework
44
Assignment #2: Setup
45
Assignment #2: Basics setup
46
Assignment #2: SSH Connect
47
Assignment #2: Ambari password change
48
Assignment #2: Open ports
49
Assignment #2: Full tutorial
50
Assignment #2: Exercises
51
Assignment #2: Questions
• Note: you DO need to hand in answers to below.
Instructions on Quercus.
1. Describe a business situation in which Hadoop would be
a better choice for storing the data than a relational
database and explain why (give at least three reasons)
2. Describe a business situation in which a relational
database would be the better option and explain why
(give at least three reasons)
3. Describe a business situation where MongoDB would a
better choice than either and again give at least three
reasons
4. What is the point in having and using Pig if we have Hive
so can use SQL (again, three advantages)?
52
Module 3 – Section 11
Next Class
53
Next Class
54
Follow us on social
facebook.com/uoftscs
@uoftscs
linkedin.com/company/university-of-toronto-school-of-continuing-studies
@uoftscs
55
Any questions?
56
Thank You
Thank you for choosing the University of Toronto
School of Continuing Studies
57