Hadoop 101 & Why Cloudera?: Wahyu Budiman
Hadoop 101 & Why Cloudera?: Wahyu Budiman
Hadoop 101 & Why Cloudera?: Wahyu Budiman
- Wahyu Budiman
1 https://issues.apache.org/jira/browse/NUTCH-193
Telecom
Healthcare &
Life Sciences
Media &
Technology
Retail &
CP
Public
Sector
Customer 360 Customer 360 Connected: Car, Customer Loyalty Border Control
Fraud / Cyber Churn prediction Plane, Equipment Ship to Store Risk / Intelligence
Compliance Network Agile Supply Chain Agile Supply Chain Tax Optimization
Spend Analytics Optimization Predictive Next Best Offer Cybersecurity
Operational Data Data Monetization Maintenance Connected Store Fraud prevention
Store EDW Augmentation IoT Data enabled Completed baskets Intelligence
Market Data Media Streaming “Smart Services” IoT – Stores Patient care
Algo Trading Active Archive Clinical trials Active archiving Citizen 360
Active Archive Predictive Diagnostics Smart Vessel Patient Records
maintenance SAP active archive
Compute Compute
Process-centric
Data businesses use:
Data
Structured data mainly
Internal data only Information-centric
Compute Data “Important” data only Compute businesses use all data:
multi-structured,
internal & external data
Data Siloed data sources of all types
Compute Compute
Data
©2014 Cloudera, Inc. All rights © Cloudera, Inc. All rights reserved. 12
Data Lake, Data Cleansing, BI
MapReduce
MapReduce
MapReduce
MapReduce
Sqoop
MapReduce
Sqoop Spark
Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)
MapReduce
Sqoop Spark
Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)
Pig Hive
MapReduce Spark
Workload (YARN)
Sqoop Spark
Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)
Pig Hive
Impala Search
MapReduce Spark
(SQL) (Solr)
Workload
(YARN)
Hadoop Distributed File System (HDFS)
Avro Parquet Encryption on Disk
Solr Spark
Sqoop indexing Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)
Pig Hive
Solr Spark
Sqoop indexing Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)
Pig Hive
Cloudera Navigator
Cloudera Manager
MapReduce Spark
(SQL) (Solr) (HBase)
Workload
Hue
Oozie
(YARN)
Hadoop Distributed File System (HDFS)
Avro Parquet Encryption on Disk
Solr Spark
Sqoop indexing Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)
Unlimited Storage
Infrastructure
60% 40,000
of the Fortune 100 have people on Hadoop since
attended live Cloudera 2009
training
Hadoop delivers:
Process Discover Model Serve • One place for unlimited data
Batch, Stream SQL, Search Analytics, ML NoSQL
• Unified, multi-framework data access
Security, Governance, Administration
CLOUDERA NAVIGATOR
MANAGEMENT
Flexible BATCH ANALYTIC SEARCH MACHINE STREAM 3RD PARTY
Cost-Effective PROCESSING SQL ENGINE LEARNING PROCESSING APPS
DATA
MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING
Managed ✖
✔ WORKLOAD MANAGEMENT YARN
MANAGEMENT
CLOUDERA MANAGER
UNIFIED, ELASTIC, RESILIENT,, SECURE SENTRY
SYSTEM
Architecture
✖
✔
FILESYSTEM ONLINE NOSQL
Secure and
HDFS HBASE
Governed
Cloudera Manager
Focus on the solution, not the
cluster, with the only complete,
zero-downtime administration
tool for Apache Hadoop.
Unique Capabilities:
• Unified configuration, management
and monitoring across all services
• Online installation and upgrades
• Direct connection to Cloudera Support
• 3rd Party Extensibility
© Cloudera, Inc. All rights reserved. 42
The Only Portable Cloud Experience for Hadoop
Maximize flexibility in Hadoop deployment architectures.
Cloudera Director
The first portable, self-service
solution for deploying and
managing enterprise-grade
Hadoop in the Cloud.
Unique Capabilities:
• Dynamic cluster lifecycle management
• Cloud blueprints
• Multi-cluster health visibility
• Usage reporting for billing models
Cloudera Navigator
Minimize risk and maintain
compliance with the only native
end-to-end data governance
solution for Apache Hadoop.
Unique Capabilities:
• Auditing
• Lineage
• Metadata Tagging and Discovery
• Lifecycle Management
Navigator Optimizer
Instantly understand data
warehouse and Hadoop cluster
usage, and drive optimizations
to reduce cost and improve
performance.
Unique Capabilities:
• Schema and workload profiling
• Data model discovery
• Optimization guidance
• Optimization automation (future)
© Cloudera, Inc. All rights reserved. 46
The Only Comprehensively Secure Hadoop Platform
Meet compliance requirements and reduce risk exposure from storing sensitive data.