Hadoop and Related Tools

Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

3252 - Big Data Management

Systems and Tools


Module 3: Hadoop and Related Tools

1
Course Plan
Module Titles
Module 1 – Reliable, Scalable Data-Intensive Applications
Module 2 – Storing Data
Current Focus: Module 3 – Hadoop and Related Tools
Module 4 – Just Enough Scala for Spark
Module 5 – Introduction to Spark
Module 6 – Big Data Analytics and Processing Streaming Data
Module 7 – Spark 2
Module 8 – Tools for Moderately-Sized Datasets
Module 9 – Data Pipelines
Module 10 – Working with Distributed Databases
Module 11 – Big Data Architecture
Module 12 – Term Project Presentations (no content)

2
Learning Outcomes for this Module

• Describe the use and features of Hadoop


• Explain the operation of MapReduce
• Describe the various tools in the Hadoop
Ecosystem
• Build hands-on skills with Hadoop

3
Topics for this Module

• 3.1 Introduction to Hadoop


• 3.2 Hadoop Distributed File System (HDFS)
• 3.3 Yet Another Resource Negotiator (YARN)
• 3.4 MapReduce
• 3.5 Hadoop cluster management tools
• 3.6 Hadoop storage
• 3.7 Getting data into and out of Hadoop
• 3.8 Hadoop-related databases and analytics
• 3.9 Resources
• 3.10 Homework
• 3.11 Next Class

4
Module 3 – Section 1

Introduction to Hadoop

5
Hadoop 3 Architecture

Source: Hortonworks

6
Module 3 – Section 2

Hadoop Distributed File System


(HDFS)

7
The Hadoop Distributed File System (HDFS)
• Stores very large files as blocks
• Designed for commodity servers
• Designed to scale easily and effectively
• Achieves reliability through replication
• But is a poor choice for:
− Low latency data access
− Storing lots of small files
− Multiple writers
− Modification within files

8
HDFS (cont’d)

• By default creates three replicas of data


− First on same node as client
− Second on different rack
− Third on same rack as second but different node
• Transparently checksums all data
• Corrupt replicas are automatically replaced from good
copies
• Variety of data compression algorithms
• Transparent end-to-end encryption

9
HDFS Concepts

• Blocks
− Data is stored in large blocks (default 128MB) to reduce seek
time
− Files consist of blocks and so can be larger than a single disk
and split across many servers
• Namenodes
− Master server that manages the filesystem namespace and
metadata
− Single point of failure

10
HDFS Concepts (cont’d)

• Datanodes
− Slave servers where data is stored
• HDFS Federation
− Can have multiple namenodes, each managing different
filesystem subtrees
• HDFS High Availability
− Active-passive configuration with automatic failover (takes
about a minute)
− Can alternatively configure a new namenode (takes half an
hour or more)

11
Isn’t This Just Grid Computing?

Grid Computing Hadoop

CPU-intensive Data-intensive (and possibly also


CPU-intensive)

Possibly highly distributed Distributed but data and CPU close


together to conserve bandwidth

Low-level coding High-level coding

Typically unaware of network Network topology aware and


topology and passes work to any attempts to give related work to
available processor physically close processors

Data is sent to the processors Program is sent to the processors to


work on data held locally

12
Module 3 – Section 3

Yet Another Resource Negotiator


(YARN)

13
YARN

• Yet Another Resource Negotiator


• Two parts:
− ResourceManager: for allocating resources and scheduling
applications
− ApplicationMaster: creates a container for each task
• Apache Myriad allows Hadoop and Mesos to share the
same physical infrastructure

14
Module 3 – Section 4

MapReduce

15
MapReduce
• A simple parallel programming paradigm
• 2-3 phases
− Map: Split the problem into chunks and tackle them
− (Shuffle): Organize the results into matching groups
− Reduce: Combine the results of each group
• Not all problems can be carved up this way
• Works with a variety of programming languages
• System “looks after” message passing, scheduling on
machines, managing assembly of partial results

16
MapReduce (cont’d)

sort
copy
split 0 map merge

reduce part 0

split 1 map

reduce part 1

split 2 map

17
MapReduce (cont’d)
• Mappers get (key: value) from files and emit (key, value)
where value has been mapped to something else
• Shuffle and Sort: groups output from above by key so we
now have (key: [value, value, …])
• Never need to deal with the shuffle/sort phase explicitly; it’s
handled by the framework
• Reducers take sorted (key: [values]) as input and emit (key:
aggregated_value)

18
Module 3 – Section 5

Hadoop Cluster Management Tools

19
Ambari
• Provision a Hadoop cluster
• Manage and monitor the cluster

20
Sentry

• A system for enforcing fine-grained role based authorization


to data and metadata stored on a Hadoop cluster
• Originally Cloudera Access
• Works out of the box with Apache Hive and Cloudera Impala

21
Ranger
• Vision is to provide comprehensive security across the
Hadoop ecosystem
• Today provides fine-grained authorization and auditing for:
− Hadoop
− Hive
− HBase
− Storm
− Knox
• Centralized web application
− Policy administration
− Audit
− Reporting

22
Falcon

• Data processing and management platform for Hadoop to:


− Simplify moving data
− Provide process orchestration (coordination), data discovery
and lifecycle management

23
ZooKeeper
• Centralized coordination service for distributed systems
• Provides centralized services such as maintaining
configuration information, naming, distributed
synchronization and group services

24
Azkaban

• Workflow scheduler

25
Module 3 – Section 6

Hadoop Storage

26
Avro

• Language-neutral (C, C++, Java, Python, Ruby) system for


storing objects (“data serialization”)
• Supports primitive (e.g. int, string) and complex (e.g. array,
enum) types
• Fast and interoperable
• Designed to work well with HDFS and MapReduce

27
Parquet RCFiles & ORCFiles

• Columnar data storage

28
Module 3 – Section 7

Getting Data Into and Out of


Hadoop

29
Flume
• Can handle arbitrary types of streaming data
• Can configure the reliability level of the data collection
based on how important the data is and how timely it needs
to be
• Can have it batch the data up for transfer or move it as soon
as it becomes available
• Can do some lightweight transformations that make it handy
for ETL

30
Sqoop
• NoSQL database
• Supports large (potentially sparse) tables
• Based on Google BigTable
• Stores data in columns but isn’t really column-oriented
• Excels at key-based access to a single cell or range of cells

31
Pig

• Dataflow/workflow language developed at Yahoo!


• Shell is called Grunt
• Steps get added to a logical plan built by the interpreter
which doesn’t get executed until a DUMP or STORE
command
• The scripts are converted to MapReduce jobs like Hive
• Good for ETL and dealing with unstructured data

32
Pig Latin Example
users = LOAD 'users’ AS (name, age);
filteredUsers = FILTER Users BY age >= 18 AND age <=
25;
pages = LOAD 'pages’ AS (user, url);
joinedUserPages = JOIN filteredUsers BY name, Pages BY
user;
groupedUserPages = GROUP joinedUserPages BY url;
numberOfClicks = FOREACH groupedUserPages GENERATE
GROUP, COUNT(joinedUserPages) AS clicks;
sortedNumberOfClicks = ORDER numberOfClicks BY clicks
DESC;
top5 = LIMIT sortedNumberOfClicks 5;
STORE top5 INTO 'top5sites';

33
Kafka

• Kafka is a distributed publish-subscribe messaging system


• Messages are published by producers, categorized into
topics and pulled for processing by consumers
• Consumers subscribe to topics they want to read messages
from
• Kafka maintains a partitioned log for each topic
• Each partition is an ordered subset of the messages
published to a topic
• Partitions of the same topic can be spread across many
nodes in a cluster and are usually replicated for reliability

34
Module 3 – Section 8

Hadoop-Related Databases and


Analytics

35
HCatalog
• Provides metadata services (and REST interface)
• Stored in a relational database outside Hadoop
• Makes Hive possible

36
Hive
• Relational-like table store in Hadoop
• Hive table consists of a schema stored in HCatalog plus
data stored in HDFS
• Converts HiveQL commands into batch jobs
• Can delete rows and add small batches of inserts but
changes are not in-place
• Limited index and subquery capability
• Supports both primitive (e.g. BOOLEAN, INT, FLOAT,
VARCHAR, BINARY, etc.) and complex (ARRAY, MAP,
STRUCT, UNION)

37
HBase

• NoSQL database
• Supports very large (potentially sparse) tables
• Latency in milliseconds
• Based on Google BigTable
• Stores data in column families
• Excels at key-based access to a single cell or range of cells

38
Drill

• A fast analytical engine for querying non-relational


datastores using SQL
• Based on Google BigQuery SQL (different from Hive’s)
• Columnar execution engine
• Data-driven compilation and recompilation at execution
time
• Specialized memory management that reduces memory
requirements and eliminates garbage collections
• Locality-aware execution that reduces network traffic
when Drill is co-located with the datastore
• Advanced cost-based optimizer that pushes processing
into the datastore when possible
39
Mahout

• Provides a variety of machine learning algorithms that run


on Hadoop (or other databases)
• Requires Java programming skills to use
• Primarily used for:
• Recommendations (aka Collaborative Filtering)
• Clustering e.g. documents by topic
• Classification
• Frequent itemset mining: find items that often appear
together, say in queries or shopping baskets

40
Module 3 – Section 9

Resources

41
Resources
• White, Tom. Hadoop: The Definitive Guide 4th Edition.
O’Reilly. 2015.
• Grover et. al. Hadoop Application Architectures. O’Reilly.
2015.
• Sammer, Eric. Hadoop Operations. O’Reilly. 2012.
• Kunigk & George. Hadoop in the Enterprise: Architecture.
O’Reilly. 2018.
• Holmes, Alex. Hadoop in Practice. Manning. 2014.
• Capriolo & Wampler. Programming Hive. O’Reilly. 2012.

42
Resources (cont’d)
• Free online PIG book:
• Google paper on MapReduce:

43
Module 3 – Section 10

Homework

44
Assignment #2: Setup

• Sign up for for Azure free trial: portal.azure.com


• Go to Marketplace (button at the bottom of the dashboard) .
• Enter “hdp” in Search and choose Hortonworks Sandbox
with HDP 2.5 that should show up.
• Click Create at the bottom of the page.

45
Assignment #2: Basics setup

46
Assignment #2: SSH Connect

47
Assignment #2: Ambari password change

48
Assignment #2: Open ports

49
Assignment #2: Full tutorial

• Installing Hortonworks Data Platform 2.5 on Microsoft


Azure: Tutorial
• It also has full list of ports that you need to open.

50
Assignment #2: Exercises

• Note: you do NOT need to hand in the below.


• Read tutorial:
– “Learning the Ropes of the HDP Sandbox”
• Complete the following tutorials:
– “How to Process data with Apache Hive”
– How to Process data with Apache Pig

51
Assignment #2: Questions
• Note: you DO need to hand in answers to below.
Instructions on Quercus.
1. Describe a business situation in which Hadoop would be
a better choice for storing the data than a relational
database and explain why (give at least three reasons)
2. Describe a business situation in which a relational
database would be the better option and explain why
(give at least three reasons)
3. Describe a business situation where MongoDB would a
better choice than either and again give at least three
reasons
4. What is the point in having and using Pig if we have Hive
so can use SQL (again, three advantages)?
52
Module 3 – Section 11

Next Class

53
Next Class

• Next class will be Just Enough Scala to get ready for


working with Spark
• In preparation for next class: Read Chapter 1 of
Programming in Scala:

54
Follow us on social

Join the conversation with us online:

facebook.com/uoftscs

@uoftscs

linkedin.com/company/university-of-toronto-school-of-continuing-studies

@uoftscs

55
Any questions?

56
Thank You
Thank you for choosing the University of Toronto
School of Continuing Studies

57

You might also like