Cloudera Apache Hadoop 101
Cloudera Apache Hadoop 101
Cloudera Apache Hadoop 101
Who We Are
2
2011 Cloudera, Inc. All Rights Reserved.
Users of Cloudera
Financial Retail &
Web Telecom Media
Consumer
3
2011 Cloudera, Inc. All Rights Reserved.
What is Apache Hadoop?
4
2011 Cloudera, Inc. All Rights Reserved.
What Makes Hadoop
Different?
Ability to scale out to Petabytes in size using
commodity hardware
Processing (MapReduce) jobs are sent to the
data versus shipping the data to be processed
Hadoop doesnt impose a single data format
so it can easily handle structure, semi-
structure and unstructured data
Manages fault tolerance and data replication
automatically
5
2011 Cloudera, Inc. All Rights Reserved.
Why the Need for Hadoop?
10,000
GIGABYTES OF DATA CREATED (IN BILLIONS)
6
2011 Cloudera, Inc. All Rights Reserved.
Hadoop Use Cases
Use Case Application Industry Application Use Case
DATA PROCESSING
Network Analytics Telco Mediation
7
2011 Cloudera, Inc. All Rights Reserved.
Hadoop in the Enterprise
Management Enterprise
IDEs BI / Analytics
Tools Reporting
CUSTOMERS
Enterprise Data
Warehouse
Web
Application
Relational
Logs Files Web Data
Databases
8
2011 Cloudera, Inc. All Rights Reserved.
What is CDH?
9
2011 Cloudera, Inc. All Rights Reserved.
Clouderas Commitment to the Open
Source Community
Component Cloudera Committers Cloudera Founder 2011 Commits
Common 6 Yes #1
HDFS 6 Yes #2
MapReduce 5 Yes #1
HBase 2 No #2
Zookeeper 1 Yes #2
Oozie 1 Yes #1
Pig 0 No #3
Hive 1 No #2
Sqoop 2 Yes #1
Flume 3 Yes #1
Hue 3 Yes #1
Snappy 2 No #1
Bigtop 8 Yes #1
Avro 4 Yes #1
Whirr 2 Yes #1
10
2011 Cloudera, Inc. All Rights Reserved.
Components of CDH
Cloudera Enterprise
User Interface
HUE
Languages / Compilers
APACHE PIG, APACHE HIVE
Fast Read/Write
Data Integration
Access
11
2011 Cloudera, Inc. All Rights Reserved.
Hadoop Distributed File
System
Block Size = 64MB
2 1
Replication Factor = 3
4 2
5 5
1
2 1
HDFS
3 3
4 4
5 2
5
1
3
3
Cost is $400-$500/TB 4
5
12
2011 Cloudera, Inc. All Rights Reserved.
Components of Hadoop
13
2011 Cloudera, Inc. All Rights Reserved.
Components of Hadoop
14
2011 Cloudera, Inc. All Rights Reserved.
Networking
15
2011 Cloudera, Inc. All Rights Reserved.
Map
Map Shuffle
(key 2, (key 1, int. Reduce Final (key,
Task Phase
values) values) Task values)
16
2011 Cloudera, Inc. All Rights Reserved.
Reduce
After the map phase is over, all the intermediate values for
a given output key are combined together into a list
Map Shuffle
(key 2, (key 1, int. Reduce Final (key,
Task Phase
values) values) Task values)
17
2011 Cloudera, Inc. All Rights Reserved.
MapReduce Execution
18
2011 Cloudera, Inc. All Rights Reserved.
Sqoop
SQL to Hadoop
Tool to import/export any JDBC-supported database into Hadoop
Transfer data between Hadoop and external databases or EDW
High performance connectors for some RDBMS
Developed at Cloudera
19
2011 Cloudera, Inc. All Rights Reserved.
Flume
Distributed, reliable, available service for efficiently moving
large amounts of data as it is produced
Suited for gathering logs from multiple systems
Inserting them into HDFS as they are generated
Design goals
Reliability, Scalability, Manageability, Extensibility
Developed at Cloudera
20
2011 Cloudera, Inc. All Rights Reserved.
Flume: high-level
architecture
Master send
Configurable levels of reliability
configuration to all
Guarantee delivery in event of
Agents failure
Agent Agent Agent Agent
Deployable, centrally administered
encrypt
MASTER
Optionally pre-process incoming
Processor Processor data: perform transformations,
suppressions, metadata enrichment
compress batch
encrypt
21
2011 Cloudera, Inc. All Rights Reserved.
HBase
Column-family store. Based on design of Google BigTable
Provides interactive access to information
Holds extremely large datasets (multi-TB)
Constrained access model
(key, value) lookup
Limited transactions (only one row)
22
2011 Cloudera, Inc. All Rights Reserved.
HBase
23
2011 Cloudera, Inc. All Rights Reserved.
Hive
SQL-based data warehousing application
Language is SQL-like
Supports SELECT, JOIN, GROUP BY, etc.
Features for analyzing very large data sets
Partition columns, Sampling, Buckets
Example:
SELECT s.word, s.freq, k.freq FROM shakespeares
JOIN ON (s.word= k.word) WHERE s.freq >= 5;
24
2011 Cloudera, Inc. All Rights Reserved.
Pig
Data-flow oriented language Pig latin
Datatypes include sets, associative arrays, tuples
High-level language for routing data, allows easy
integration of Java for complex tasks
Example:
emps=LOAD 'people.txt AS(id,name,salary);
rich = FILTER emps BY salary > 100000; srtd =
ORDER rich BY salary DESC; STORE srtd INTO
rich_people.txt';
25
2011 Cloudera, Inc. All Rights Reserved.
Oozie
Oozie is a workflow/cordination service to manage data processing
26
2011 Cloudera, Inc. All Rights Reserved.
Zookeeper
Zookeeper is a distributed consensus engine
Provides well-defined concurrent access semantics:
Leader election
Service discovery
Distributed locking / mutual exclusion
Message board / mailboxes
27
2011 Cloudera, Inc. All Rights Reserved.
Pipes and Streaming
28
2011 Cloudera, Inc. All Rights Reserved.
FUSE - DFS
29
2011 Cloudera, Inc. All Rights Reserved.
Hadoop Security
Authentication is secured by Kerberos v5 and integrated with LDAP
Hadoop server can ensure that users and groups are who they say they are
Job Control includes Access Control Lists, which means Jobs can specify who
can view logs, counters, configurations and who can modify a job
Tasks now run as the user who launched the job
30
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Enterprise
Cloudera Enterprise makes CLOUDERA ENTERPRISE COMPONENTS
open source Hadoop enterprise-easy
Simplify and Accelerate Hadoop Deployment Cloudera Production-Level
Manager Support
Reduce Adoption Costs and Risks
Lower the Cost of Administration
End-to-End Management Our Team of Experts On-
Increase the Transparency Control of Hadoop Application for Apache Call to Help You Meet
Hadoop Your SLAs
Leverage the Experience of Our Experts
EFFECTIVENESS EFFICIENCY
Ensuring You Enabling You to
Get Value From Your Hadoop Deployment Affordably Run Hadoop in Production
31
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager
32
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Enterprise
Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA
requirements
Configuration Checks Verify that your Hadoop cluster is fine-tuned for your
environment
Issue Resolution and Proven processes ensure that support cases get
resolved with maximum efficiency
Escalation Processes
34
2011 Cloudera, Inc. All Rights Reserved.
Cloudera University
Class Description
Developer Training & Certification Hands-on training and certification for developers who want
(4 Days) to analyze their data but are new to Apache Hadoop
System Administrator Training & Hands-on training and certification for administrators who
Certification (3 Days) will be responsible for setting up, configuring, monitoring an
Apache Hadoop cluster
HBase Training (2 Day) Covers the HBase architecture, data model, and Java API as
well as some advanced topics and best practices
Analyzing Data with Hive and Pig Hive and Pig training is designed for people who have a
(2 Days) basic understanding of how Apache Hadoop works and want
to utilize these languages for analysis of their data
Essentials for Managers (1 Day) Provides decision-makers the information they need to know
about Apache Hadoop, answering questions such as when
is Hadoop appropriate?, what are people using Hadoop
for? and what do I need to know about choosing Hadoop?
35
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Consulting Services
Put Our Expertise To Work For You.
Service Description
Use Case Discovery Assess the appropriateness and value of Hadoop
for your organization
New Hadoop Deployment Set up and configure high performance,
production-ready Hadoop clusters
Proof of Concept Verify the prototype functionality and project
feasibility for a new Hadoop cluster
Production Pilot Deploy your first production-level project using
Hadoop
Process and Team Development Define the requirements and processes for
creating a new Hadoop team
Hadoop Deployment Certification Perform periodic health checks to certify and tune
up existing Hadoop clusters
36
2011 Cloudera, Inc. All Rights Reserved.
Journey of the Cloudera
Customer
37
2011 Cloudera, Inc. All Rights Reserved.
Cloudera in Production
Consulting Services
Cloudera University Cloudera Services
Cloudera Enterprise
Management Cloudera Management Suite Enterprise Web
Cloudera Support IDEs BI / Analytics
Tools Reporting Application
Enterprise Data
Warehouse
Clouderas Distribution Including
Apache Hadoop (CDH)
& Operational Rules
SCM Express Engines
Relational
Logs Files Web Data
Databases
38
2011 Cloudera, Inc. All Rights Reserved.
Get Cloudera helps you profit
Hadoop from all your data.
facebook.com/
cloudera
39
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager
40
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager
ONLY
Automated Deployment Installs the complete Hadoop stack in minutes. The simple, wizard-based
ONLY
CLOUDERA
CLOUDERA
Centralized Management Gives you complete, end-to-end visibility and control over your Hadoop
cluster from a single interface
ONLY
Service & Configuration Set server roles, configure services and manage security across the cluster
ONLY
CLOUDERA
CLOUDERA
Management
Gracefully start, stop and restart of services as needed
ONLY
ONLY
Audit Trails
CLOUDERA
CLOUDERA
Maintains a complete record of configuration changes for SOX compliance
ONLY
ONLY
ONLY
ONLY
Scans Hadoop logs for irregularities and warns you before they impact the
cluster
41
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager
Support Integration
CLOUDERA
CLOUDERA
Takes a snapshot of the cluster state and automatically sends it to Cloudera
support to assist with resolution
ONLY
ONLY
Event Management
CLOUDERA
CLOUDERA
Creates and aggregates relevant Hadoop events pertaining to system health, log
messages, user services and activities and make them available for alerting and
searching
Operational Reports Visualize current and historical disk usage by user, group and directory
ONLY
CLOUDERA
CLOUDERA
Host Level Monitoring View information pertaining to hosts in your cluster including status, resident
memory, virtual memory and roles
42
2011 Cloudera, Inc. All Rights Reserved.
Two Editions: FREE EDITION ENTERPRISE EDITION**
Host-Level Monitoring
Configuration Management
Manage HDFS, MapReduce, HBase, Hue, Oozie & Zookeeper
Audit Trails
Start/Stop/Restart Services
Service Monitoring
Proactive Health Checks
43 Log Management
Intelligent
2011 Cloudera, Inc. All Rights Reserved.
View Service Health and Performance
44
2011 Cloudera, Inc. All Rights Reserved.
Get Host-Level Snapshots
45
2011 Cloudera, Inc. All Rights Reserved.
Monitor and Diagnose Cluster Workloads
46
2011 Cloudera, Inc. All Rights Reserved.
Gather, View and Search Hadoop Logs
47
2011 Cloudera, Inc. All Rights Reserved.
Track Events From Across the Cluster
48
2011 Cloudera, Inc. All Rights Reserved.
Run Reports on System Performance & Usage
49
2011 Cloudera, Inc. All Rights Reserved.
New in Cloudera Manager 3.7
ONLY
1. Proactive Health Checks Monitors dozens of service performance metrics and alerts you
ONLY
CLOUDERA
CLOUDERA
2. Intelligent Log Management Gathers and scans Hadoop logs for irregularities and warns you
CLOUDERA
4. Support Integration Takes a snapshot of the cluster state and automatically sends it to
CLOUDERA
CLOUDERA
system health, log messages, user services and activities and make
them available for alerting and searching
6. Alerts Generates email alerts when certain events occur
ONLY
ONLY
compliance
ONLY
ONLY
8. Operational Reporting
CLOUDERA
CLOUDERA
Visualize current and historical disk usage by user, group and
directory and track MapReduce activity on the cluster by job or user
50
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Support
Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA
requirements
Configuration Checks Verify that your Hadoop cluster is fine-tuned for your
environment
Issue Resolution and Escalation Proven processes ensure that support cases get
Processes resolved with maximum efficiency
Proactive Notification of New Stay up to speed with whats going on in the Apache
Developments and Events Hadoop community
51
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Enterprise
52
2011 Cloudera, Inc. All Rights Reserved.