Hadoop Illuminated
Hadoop Illuminated
Hadoop Illuminated
Hadoop Illuminated
by Mark Kerzner and Sujee Maniyam
Dedication
To the open source community
This book on GitHub [https://github.com/hadoop-illuminated/hadoop-book]
Companion project on GitHub [https://github.com/hadoop-illuminated/HI-labs]
Acknowledgements
From Mark
I would like to express gratitude to my editors, co-authors, colleagues, and bosses who shared the thorny
path to working clusters - with the hope to make it less thorny for those who follow. Seriously, folks,
Hadoop is hard, and Big Data is tough, and there are many related products and skills that you need to
master. Therefore, have fun, provide your feedback [http://groups.google.com/group/hadoop-illuminated],
and I hope you will find the book entertaining.
"The author's opinions do not necessarily coincide with his point of view." - Victor Pelevin, "Generation
P" [http://lib.udm.ru/lib/PELEWIN/pokolenie_engl.txt]
From Sujee
To the kind souls who helped me along the way
Copyright 2013 Hadoop illuminated LLC. All Rights Reserved.
ii
Table of Contents
1. Who is this book for? ...................................................................................................... 1
1.1. About "Hadoop illuminated" ................................................................................... 1
2. About Authors ................................................................................................................ 2
3. Why do I Need Hadoop ? ................................................................................................. 5
3.1. Hadoop provides storage for Big Data at reasonable cost ............................................. 5
3.2. Hadoop allows to capture new or more data .............................................................. 6
3.3. With Hadoop, you can store data longer ................................................................... 6
3.4. Hadoop provides scalable analytics .......................................................................... 6
3.5. Hadoop provides rich analytics ............................................................................... 6
4. Big Data ....................................................................................................................... 7
4.1. What is Big Data? ................................................................................................ 7
4.2. Human Generated Data and Machine Generated Data .................................................. 7
4.3. Where does Big Data come from ............................................................................ 8
4.4. Examples of Big Data in the Real world ................................................................... 8
4.5. Challenges of Big Data ......................................................................................... 9
4.6. How Hadoop solves the Big Data problem .............................................................. 10
5. Soft Introduction to Hadoop ............................................................................................ 11
5.1. MapReduce or Hadoop? ....................................................................................... 11
5.2. Why Hadoop? .................................................................................................... 11
5.3. Meet the Hadoop Zoo .......................................................................................... 13
5.4. Hadoop alternatives ............................................................................................. 14
5.5. Alternatives for distributed massive computations ..................................................... 16
5.6. Arguments for Hadoop ........................................................................................ 17
5.7. Say "Hi!" to Hadoop ........................................................................................... 17
5.8. Chapter Summary ............................................................................................... 20
6. Hadoop for Executives ................................................................................................... 21
7. Hadoop for Developers .................................................................................................. 23
8. Hadoop Distributed File System (HDFS) -- Introduction ....................................................... 25
8.1. HDFS Concepts .................................................................................................. 25
8.1. HDFS Architecture ............................................................................................. 28
9. Introduction To MapReduce ............................................................................................ 31
9.1. How I failed at designing distributed processing ....................................................... 31
9.2. How MapReduce does it ...................................................................................... 32
9.3. How MapReduce really does it ............................................................................. 32
9.4. Who invented this? ............................................................................................. 35
9.5. The benefits of MapReduce programming ............................................................... 36
10. Hadoop Use Cases and Case Studies ............................................................................... 37
10.1. Politics ............................................................................................................ 37
10.2. Data Storage .................................................................................................... 37
10.3. Financial Services ............................................................................................. 37
10.4. Health Care ...................................................................................................... 38
10.5. Human Sciences ............................................................................................... 39
10.6. Telecoms ......................................................................................................... 39
10.7. Travel ............................................................................................................. 40
10.8. Energy ............................................................................................................ 40
10.9. Logistics .......................................................................................................... 41
11. Hadoop Distributions ................................................................................................... 43
11.1. Why distributions? ............................................................................................ 43
11.2. Overview of Hadoop Distributions ....................................................................... 43
11.3. Hadoop in the Cloud ......................................................................................... 45
12. Big Data Ecosystem ..................................................................................................... 46
iii
Hadoop Illuminated
13.
14.
15.
16.
17.
iv
46
46
46
47
47
48
48
48
49
49
49
49
50
50
50
50
51
52
52
52
54
55
55
56
57
57
57
57
57
57
58
59
59
59
60
61
61
63
63
63
List of Figures
3.1.
4.1.
5.1.
5.2.
8.1.
8.2.
8.3.
8.4.
8.5.
8.6.
9.1.
9.2.
9.3.
List of Tables
5.1. Comparison of Big Data ..............................................................................................
7.1. .................................................................................................................................
11.1. Hadoop Distributions .................................................................................................
12.1. Tools for Getting Data into HDFS ...............................................................................
12.2. Hadoop Compute Frameworks .....................................................................................
12.3. Querying Data in HDFS .............................................................................................
12.4. Real time access to data .............................................................................................
12.5. Databases for Big Data ..............................................................................................
12.6. Hadoop in the Cloud .................................................................................................
12.7. Work flow Tools ......................................................................................................
12.8. Serialization Frameworks ............................................................................................
12.9. Tools for Monitoring Hadoop ......................................................................................
12.10. Applications that run on top of Hadoop .......................................................................
12.11. Distributed Coordination ...........................................................................................
12.12. Data Analytics on Hadoop ........................................................................................
12.13. Distributed Message Processing .................................................................................
12.14. Business Intelligence (BI) Tools .................................................................................
12.15. Stream Processing Tools ...........................................................................................
12.16. YARN based frameworks .........................................................................................
12.17. Miscellaneous Stuff .................................................................................................
13.1. BI Tools Comparison : Data Access and Management ......................................................
13.2. BI Tools Comparison : Analytics .................................................................................
13.3. BI Tools Comparison : Visualizing ...............................................................................
13.4. BI Tools Comparison : Connectivity .............................................................................
13.5. BI Tools Comparison : Misc .......................................................................................
14.1. Hardware Specs ........................................................................................................
vi
16
24
43
46
46
46
47
47
48
48
48
49
49
49
49
50
50
50
50
51
52
53
53
53
54
55
About Authors
Mark contributes to a number of Hadoop-based projects, and his open source projects can be found
on GitHub [http://github.com/markkerzner]. He writes about Hadoop and other technologies in his blog
[http://shmsoft.blogspot.com/].
Mark does Hadoop training for individuals and corporations; his classes are hands-on and draw heavily
on his industry experience.
Links:
LinkedIn [https://www.linkedin.com/in/markkerzner] || GitHub [https://github.com/markkerzner] || Personal blog [http://shmsoft.blogspot.com/] || Twitter [https://twitter.com/markkerzner]
About Authors
Contributors
We would like to thank the following people that helped us make this book a reality.
Ben Burford : github.com/benburford [https://github.com/benburford]
Big Data
Big Data
time stamp,
Plus, not many databases can cope with storing billions of rows of data.
Big Data
This model however doesn't quite work for Big Data because copying so much data out to a compute
cluster might be too time consuming or impossible. So what is the answer?
One solution is to process Big Data 'in place' -- as in a storage cluster doubling as a compute cluster.
10
The short answer is that it simplifies dealing with Big Data. This answer immediately resonates with
people, it is clear and succinct, but it is not complete. The Hadoop framework has built-in power and
11
flexibility to do what you could not do before. In fact, Cloudera presentations at the latest O'Reilly Strata
conference mentioned that MapReduce was initially used at Google and Facebook not primarily for its
scalability, but for what it allowed you to do with the data.
In 2010, the average size of Cloudera's customers' clusters was 30 machines. In 2011 it was 70. When
people start using Hadoop, they do it for many reasons, all concentrated around the new ways of dealing
with the data. What gives them the security to go ahead is the knowledge that Hadoop solutions are massively scalable, as has been proved by Hadoop running in the world's largest computer centers and at the
largest companies.
As you will discover, the Hadoop framework organizes the data and the computations, and then runs your
code. At times, it makes sense to run your solution, expressed in a MapReduce paradigm, even on a single
machine.
But of course, Hadoop really shines when you have not one, but rather tens, hundreds, or thousands of
computers. If your data or computations are significant enough (and whose aren't these days?), then you
need more than one machine to do the number crunching. If you try to organize the work yourself, you
will soon discover that you have to coordinate the work of many computers, handle failures, retries, and
collect the results together, and so on. Enter Hadoop to solve all these problems for you. Now that you
have a hammer, everything becomes a nail: people will often reformulate their problem in MapReduce
terms, rather than create a new custom computation platform.
No less important than Hadoop itself are its many friends. The Hadoop Distributed File System (HDFS)
provides unlimited file space available from any Hadoop node. HBase is a high-performance unlimited-size database working on top of Hadoop. If you need the power of familiar SQL over your large data
sets, Pig provides you with an answer. While Hadoop can be used by programmers and taught to students
as an introduction to Big Data, its companion projects (including ZooKeeper, about which we will hear
later on) will make projects possible and simplify them by providing tried-and-proven frameworks for
every aspect of dealing with large data sets.
As you learn the concepts, and perfect your skills with the techniques described in this book you will
discover that there are many cases where Hadoop storage, Hadoop computation, or Hadoop's friends can
help you. Let's look at some of these situations.
Do you find yourself often cleaning the limited hard drives in your company? Do you need to transfer
data from one drive to another, as a backup? Many people are so used to this necessity, that they consider
it an unpleasant but unavoidable part of life. Hadoop distributed file system, HDFS, grows by adding
servers. To you it looks like one hard drive. It is self-replicating (you set the replication factor) and thus
provides redundancy as a software alternative to RAID.
Do your computations take an unacceptably long time? Are you forced to give up on projects because
you dont know how to easily distribute the computations between multiple computers? MapReduce
helps you solve these problems. What if you dont have the hardware to run the cluster? - Amazon EC2
can run MapReduce jobs for you, and you pay only for the time that it runs - the cluster is automatically
formed for you and then disbanded.
But say you are lucky, and instead of maintaining legacy software, you are charged with building new,
progressive software for your company's work flow. Of course, you want to have unlimited storage,
solving this problem once and for all, so as to concentrate on what's really important. The answer is:
you can mount HDFS as a FUSE file system, and you have your unlimited storage. In our cases studies
we look at the successful use of HDFS as a grid storage for the Large Hadron Collider.
Imagine you have multiple clients using your on line resources, computations, or data. Each single use
is saved in a log, and you need to generate a summary of use of resources for each client by day or by
hour. From this you will do your invoices, so it IS important. But the data set is large. You can write
12
a quick MapReduce job for that. Better yet, you can use Hive, a data warehouse infrastructure built on
top of Hadoop, with its ETL capabilities, to generate your invoices in no time. We'll talk about Hive
later, but we hope that you already see that you can use Hadoop and friends for fun and profit.
Once you start thinking without the usual limitations, you can improve on what you already do and come
up with new and useful projects. In fact, this book partially came about by asking people how they used
Hadoop in their work. You, the reader, are invited to submit your applications that became possible with
Hadoop, and I will put it into Case Studies (with attribution :) of course.
13
ZooKeeper
Every zoo has a zoo keeper, and the Hadoop zoo is no exception. When all the Hadoop animals want to
do something together, it is the ZooKeeper who helps them do it. They all know him and listen and obey
his commands. Thus, the ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services.
ZooKeeper is also fault-tolerant. In your development environment, you can put the zookeeper on one
node, but in production you usually run it on an odd number of servers, such as 3 or 5.
14
with Big Data. Granted, our concentration here is Hadoop, and we may not give justice to all the other
approaches. But we will try.
15
the specifics of your project and of your view will be different. Therefore, the table below is mainly to
encourage the reader to do a similar evaluation for his own needs.
Cassandra
Vertica
CloudTran
HyperTable
Pros
Key-based
NoSQL, active
user community, Cloudera
support
Key-based
NoSQL, active
user community, Amazon's
Dynamo on
EC2
Closed-source,
SQL-standard,
easy to use,
visualization
tools, complex
queries
Closed-source
optimized on
line transaction
processing
Cons
Vendor lock-in,
price, RDMS/
BI - may not fit
every application
Notes
To be kept in
mind as a possible alternative
16
allows existing applications to scale without a significant rewrite. By now we have gotten pretty far away
from Hadoop, which proves that we have achieved our goal - give the reader a quick overview of various
alternatives for building large distributed systems. Success in whichever way you choose to go!
17
I am not pulling your leg: it is that simple. That is the essence of a MapReduce job. Hadoop uses the cluster
of computers (nodes), where each node reads words in parallel with all others (Map), then the nodes collect
the results (Reduce) and writes them back. Notice that there is a sort step, which is essential to the solution,
and is provided for you - regardless of the size of the data. It may take place all in memory, or it may spill
to disk. If any of the computers go bad, their tasks are assigned to the remaining healthy ones.
How does this dialog look in the code?
Geek Talk
Hadoop: How can I help?
Hadoop:
Hadoop: I will read the words and give them to you, one at a time, can you count that?
Hadoop
public
public
throws
String
You: Yes, I will assign a count of 1 to each and give them back to you.
You
You have done more than you promised - you can process multiple words on the same line, if Hadoop
chooses to give them to you. This follows the principles of defensive programming. Then you immediately
realize that each input line can be as long as it wants. In fact, Hadoop is optimized to have the best overall
18
throughput on large data sets. Therefore, each input can be a complete document, and you are counting
word frequencies in documents. If the documents come from the Web, for example, you already have the
scale of computations needed for such tasks.
Hadoop: Very good. I will sort them, and will give them back to you in groups, grouping the same words
together. Can you count that?
Listing 1.5 Hadoop
You: Yes, I will go through them and give you back each word and its count.
You
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
Hadoop: I will record each word with its count, and well be done.
Hadoop:
19
context.write(key, value);
Now the framework sorts your maps. Sorting is an interesting process and it occurs a lot in computer processing, so we will talk in more detail about it in the next chapter. Having sorted the maps, the framework
gives them back to you in groups. You supply the code which tells it how to process each group. This is
the Reducer. The key that you gave to the framework is the same key that it returns to your Reducer. In
our case, it was a word found in a document.
In the code, this is how we went through a group of values:
Going through values in the reducer
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
While the key was now the word, the value was count - which, as you may remember, we have set to 1
for each word. These are being summed up.
Finally, you return the reduced result to the framework, and it outputs results to a file.
Reducer emits the final map
20
What is Hadoop?
Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed
storage and distributed processing for very large data sets.
21
See here for complete list of Chapter 15, Hadoop Challenges [57]
22
What is Hadoop?
Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed
storage and distributed processing for very large data sets.
23
Table 7.1.
Job Type
Job functions
Skills
Hadoop Developer
Hadoop Admin
Data Scientist
Data mining and figuring out hid- Math, data mining algorithms
den knowledge in data
Business Analyst
Analyzes data!
24
25
We want our system to be cost-effective, so we are not going to use these 'expensive' machines. Instead
we will opt to use commodity hardware. By that we don't mean cheapo desktop class machines. We will
use performant server class machines -- but these will be commodity servers that you can order from any
of the vendors (Dell, HP, etc)
So what do these server machines look like? Look at the Chapter 14, Hardware and Software for
Hadoop [55] guide.
26
27
Lets go over some principles of HDFS. First lets consider the parallels between 'our design' and the actual
HDFS design.
28
Data is replicated
So how does Hadoop keep data safe and resilient in case of node failure? Simple, it keeps multiple copies
of data around the cluster.
To understand how replication works, lets look at the following scenario. Data segment #2 is replicated 3
times, on data nodes A, B and D. Lets say data node A fails. The data is still accessible from nodes B and D.
29
30
31
Introduction To MapReduce
MapReduce is stable
Recall that in my system I gave the responsibility for selecting the next piece of work to the workers. This
created two kinds of problems. When a worker crashed, nobody knew about it. Of course, the worker would
mark the work as "done" after it was completed, but when it crashed, there was nobody to do this for him, so
32
Introduction To MapReduce
it kept hanging. You needed watchers over watchers, and so on. Another problem would be created when
two overzealous workers wanted the same portion. There was a need to somehow coordinate this effort.
My solution was a flag in the database, but then this database was becoming the real-time coordinator for
multiple processes, and it is not very good at that. You can image multiple scenarios when this would fail.
By contrast, in MapReduce the Job Tracker doles out the work. There is no contention: it takes the next
split and assigns it to the next available Tasktracker. If a Tasktracker crashes, it stops sending heartbeats
to the Job Tracker.
33
Introduction To MapReduce
Or say, you don't want to drop the inputs with the same key, but rather you want to analyze them. For
example, you may be reading financial records, checking account transactions, and you may want to group
all transaction for the same accounts together, so that you can calculate the balance. Again, MapReduce
has already done this for you: all records for the same account will be grouped together (the sorting being
done by the Hadoop framework), and all you have to do is calculate the total.
We can now introduce the concept of Mappers and Reducers: Mappers process, or understand the input,
and express this understanding by assigning the key, and potentially extracting or changing the value (such
as converting the currency). The Reducer receives all values with the same key, and can loop through
them, to perform any operation required.
The Mapper's pair of key and value, usually denoted as <key, value>, is called a map. The Mapper that
does the above calculation is called to "emit the map", and the Reducer then loops through all the maps,
and summarizes the results in some way.
A practical example
To explain the MapReduce calculation better, let me give you a practical example. In one ancient book
I read that to get around through the chambers of Heaven, you had to give a seal to the angel who was
guarding each door. The seal was to be created in your own thought: you concentrate on the mystical name
until it becomes a real physical seal. You then hand this seal to the angel, and he lets you in.
To calculate the meaning of the seal, the angel had to take the numerical value assigned to each letter,
and add them up.
This then becomes the perfect analogy of the MapReduce calculation. Your value is the mystical name,
represented by the seal. It may be a simple string or a multi-dimensional object. From this value, you
calculate the key, or the sum total of each letter. This is the Mapper.
With this <key, value> combination you begin your travels. All the people who have the same key are
collected by the same angel in a certain chamber. This is your Reducer. The computation is parallel (each
person does his own travels) and scalable (there is no limit on the number of people traveling at the same
time). This is the MapReduce calculation. If one angel fails, another can take its place, because angels too
are defined by their names. Thus you have fault tolerance.
Another example
According
to
Obama's
Big
Michael
Lynch,
Data
won
the
the
founder
of
Autonomy,
Barack
US
election
[http://www.computerworld.com/s/arti-
34
Introduction To MapReduce
This concludes our introduction to the MapReduce part of Hadoop, where our goal was to explain how
it works, without one line of code.
35
Introduction To MapReduce
The speed of light in a vacuum used to be about 35 mph. Then Jeff Dean spent a weekend optimizing
physics.
You can read the complete article by the two Google engineers, entitled MapReduce: Simplified Data Processing on Large Clusters [http://research.google.com/archive/mapreduce.html] and decide for yourself.
36
10.1. Politics
2012 US Presidential Election
How
Big
Data
help
Obama
win
re-election
[http://www.computerworld.com/s/article/9233587/Barack_Obama_39_s_Big_Data_won_the_US_election]
by
Michael
Lynch,
the founder of Autonomy [http://www.autonomy.com/] (cached copy [cached_reports/
Barack_Obama_Big_Data_won_the_US_election__Computerworld.pdf])
37
Problem: The previous solution using Teradata and IBM Netezza was time consuming and complex, and
the data mart approach didnt provide the data completeness required for determining overall data quality.
Solution: A Cloudera + Datameer platform allows analyzing trillions of records which currently result in
approximately one terabyte per month of reports. The results are reported through a data quality dashboard.
Hadoop Vendor: Cloudera + Datameer
Cluster/Data size: 20+ nodes; 1TB of data / month
Links:
Cloudera
case
study
[http://www.cloudera.com/content/dam/cloudera/Resources/PDF/
connect_case_study_datameer_banking_financial.pdf]
(cached
copy
[cached_reports/
connect_case_study_datameer_banking_financial.pdf]) (Published Nov 2012)
38
10.6. Telecoms
China Telecom Guangdong
Problem: Storing billions of mobile call records and providing real time access to the call records and
billing information to customers.
Traditional storage/database systems couldn't scale to the loads and provide a cost effective solution
Solution: HBase is used to store billions of rows of call record details. 30TB of data is added monthly
Hadoop vendor: Intel
Hadoop cluster size: 100+ nodes
Links:
China Telecom Quangdong [http://gd.10086.cn/]
Intel case study [http://hadoop.intel.com/pdfs/IntelChinaMobileCaseStudy.pdf] (cached copy
[cached_reports/IntelChinaMobileCaseStudy.pdf]) (Published Feb 2013)
Intel APAC presentation [http://www.slideshare.net/IntelAPAC/apac-big-data-dc-strategy-update-foridh-launch-rk]
39
Nokia
Nokia collects and analyzes vast amounts of data from mobile phones
Problem:
(1) Dealing with 100TB of structured data and 500TB+ of semi-structured data
(2) 10s of PB across Nokia, 1TB / day
Solution: HDFS data warehouse allows storing all the semi/multi structured data and offers processing
data at peta byte scale
Hadoop Vendor: Cloudera
Cluster/Data size:
(1) 500TB of data
(2) 10s of PB across Nokia, 1TB / day
Links:
(1)
Cloudera
case
study
[http://www.cloudera.com/content/dam/cloudera/Resources/PDF/
Cloudera_Nokia_Case_Study_Hadoop.pdf]
(cached
copy
[cached_reports/
Cloudera_Nokia_Case_Study_Hadoop.pdf]) (Published Apr 2012)
(2) strata NY 2012 presentation slides [http://cdn.oreillystatic.com/en/assets/1/event/85/Big%20Data%20Analytics%20Platform%20at%20Nokia%20%E2%80%93%20Selecting%20the%20Right
%20Tool%20for%20the%20Right%20Workload%20Presentation.pdf] (cached copy [cached_reports/
Nokia_Bigdata.pdf])
Strata NY 2012 presentation [http://strataconf.com/stratany2012/public/schedule/detail/26880]
10.7. Travel
Orbitz
Problem: Orbitz generates tremendous amounts of log data. The raw logs are only stored for a few days
because of costly data warehousing. Orbitz needed an effective way to store and process this data, plus
they needed to improve their hotel rankings.
Solution: A Hadoop cluster provided a very cost effective way to store vast amounts of raw logs. Data
is cleaned and analyzed and machine learning algorithms are run.
Hadoop Vendor: ?
Cluster/Data size: ?
Links:
Orbitz presentation [http://www.slideshare.net/jseidman/windy-citydb-final-4635799] (Published 2010)
Datanami
article
[http://www.datanami.com/datanami/2012-04-26/six_superscale_hadoop_deployments.html]
10.8. Energy
Seismic Data at Chevron
40
Problem: Chevron analyzes vast amounts of seismic data to find potential oil reserves.
Solution: Hadoop offers the storage capacity and processing power to analyze this data.
Hadoop Vendor: IBM Big Insights
Cluster/Data size: ?
Links:
Presentation
[http://almaden.ibm.com/colloquium/resources/Managing%20More%20Bits%20Than
%20Barrels%20Breuning.PDF] (cached copy [cached_reports/IBM_Chevron.pdf]) (Published June 2012)
OPower
OPower works with utility companies to provide engaging, relevant, and personalized content about home
energy use to millions of households.
Problem: Collecting and analyzing massive amounts of data and deriving insights into customers' energy
usage.
Solution: Hadoop provides a single storage for all the massive data and machine learning algorithms are
run on the data.
Hadoop Vendor: ?
Cluster/Data size: ?
Links:
presentation
[http://cdn.oreillystatic.com/en/assets/1/event/85/Data%20Science%20with%20Hadoop
%20at%20Opower%20Presentation.pdf] (cached copy [cached_reports/Opower.pdf]) (Published Oct
2012)
Strata NY 2012 [http://strataconf.com/stratany2012/public/schedule/detail/25736]
Strata 2013 [http://strataconf.com/strata2013/public/schedule/detail/27158]
OPower.com [http://www.opower.com]
10.9. Logistics
Trucking data @ US Xpress
US Xpress - one of the largest trucking companies in US - is using Hadoop to store sensor data from their
trucks. The intelligence they mine out of this, saves them $6 million / year in fuel cost alone.
Problem: Collecting and and storing 100s of data points from thousands of trucks, plus lots of geo data.
Solution: Hadoop allows storing enormous amount of sensor data. Also Hadoop allows querying / joining
this data with other data sets.
Hadoop Vendor: ?
Cluster/Data size: ?
Links:
41
42
Tested
Performance patches
Distros have predictable product release road maps. This ensures they
keep up with developments and bug fixes.
Lot of distros come with support, which could be very valuable for a
production critical cluster.
Remarks
Free / Premium
Apache
hadoop.apache.org [http://
hadoop.apache.org]
Cloudera
www.cloudera.com [http://
www.cloudera.com]
Oldest distro
Very polished
43
Hadoop Distributions
Distro
Remarks
Comes with good tools to install and manage a Hadoop
cluster
Free / Premium
HortonWorks
www.hortonworks.com [http://
www.hortonworks.com]
Newer distro
MapR
www.mapr.com [http://
www.mapr.com]
Encryption support
Premium
Pivotal HD
gopivotal.com [http://
gopivotal.com/pivotal-products/pivotal-data-fabric/pivotal-hd]
WANdisco
www.wandisco.com [http://
www.wandisco.com]
Newer distro
Premium
Hadoop version 2
Comes with tools to manage
and administer a cluster
44
Hadoop Distributions
Elephants can really fly in the clouds! Most cloud providers offer Hadoop.
SkyTab Cloud
SkyTap offers deploy-able Hadoop templates
Links:
Skytap announcement [http://www.skytap.com/news-events/press-releases/skytap-introduces-cloudera-hadoop] || How to [http://blog.cloudera.com/blog/2013/01/how-to-deploy-a-cdh-cluster-in-the-skytap-cloud/]
45
Remarks
Flume [http://flume.apache.org/]
Scribe [https://github.com/facebook/scribe]
Chukwa [http://incubator.apache.org/chukwa/]
Sqoop [http://sqoop.apache.org]
Kafka [http://kafka.apache.org/]
Remarks
YARN [http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html]
Weave [https://github.com/continuuity/weave]
Cloudera SDK
Remarks
Pig [http://pig.apache.org/]
46
Tool
Remarks
Hive [http://hive.apache.org]
Remarks
HBase [http://hbase.apache.org/]
A NoSQL data store built on top of Hadoop. Provides real time access to data.
Accumulo [http://accumulo.apache.org/]
Impala [http://blog.cloudera.com/blog/2012/10/
cloudera-impala-real-time-queries-in-apachehadoop-for-real/]
Phoenix [http://phoenix-hbase.blogspot.com/]
Spire [http://drawntoscale.com/why-spire/]
Remarks
HBase [http://hbase.apache.org/]
47
Tool
Remarks
Cassandra [http://cassandra.apache.org/]
Redis [http://redis.io/]
Voldermort [http://www.project-voldemort.com/
voldemort/]
Remarks
Whirr [http://whirr.apache.org]
Remarks
Oozie [http://oozie.apache.org/]
Cascading [http://www.cascading.org/]
Application framework for Java developers to develop robust Data Analytics and Data Management
applications on Apache Hadoop.
Scalding [https://github.com/twitter/scalding]
Lipstick [https://github.com/Netflix/Lipstick]
Remarks
Avro [http://avro.apache.org/]
Trevni [https://github.com/cutting/trevni]
Protobuf [https://code.google.com/p/protobuf/]
48
Remarks
Hue [http://cloudera.github.com/hue/]
Developed by Cloudera.
Ganglia [http://ganglia.sourceforge.net/]
Nagios [http://www.nagios.org/]
IT infrastructure monitoring.
12.10. Applications
Table 12.10. Applications that run on top of Hadoop
Tool
Remarks
Mahout [http://mahout.apache.org/]
Giraph [http://incubator.apache.org/giraph/]
Remarks
Zookeeper [http://zookeeper.apache.org/]
Remarks
R language [http://www.r-project.org/]
RHIPE [http://www.datadr.org/]
49
Remarks
Akka [http://akka.io/]
RabbitMQ [http://www.rabbitmq.com/]
Remarks
Datameer [http://www.datameer.com/]
Tableau [http://www.tableausoftware.com/]
Pentaho [http://www.pentaho.com/]
SiSense [http://www.sisense.com/]
SumoLogic [http://www.sumologic.com/]
Remarks
Storm [http://storm-project.net/]
Apache S4 [http://incubator.apache.org/s4/]
Samza [http://samza.incubator.apache.org/]
Malhar [https://github.com/DataTorrent/Malhar]
Remarks
Samza [http://samza.incubator.apache.org/]
Spark [http://spark-project.org/]
Malhar [https://github.com/DataTorrent/Malhar]
Giraph [http://incubator.apache.org/giraph/]
50
Tool
Remarks
Storm [http://storm-project.net/]
12.17. Miscellaneous
Table 12.17. Miscellaneous Stuff
Tool
Remarks
Spark [http://spark-project.org/]
51
Access raw
data on
Hadoop
Datameer
Y
[http://
www.datameer.com]
Tableau
Y
[http://
www.tableausoftware.com/
]
52
Access raw
data on
Hadoop
Pentaho
[http://
www.pentaho.com/
]
pre-built analytics
Datameer
Y
[http://
www.datameer.com]
Tableau [http:// Y
www.tableausoftware.com/
]
Pentaho [http:// Y
www.pentaho.com/
]
Datameer
Y
[http://
www.datameer.com]
Tableau
Y
[http://
www.tableausoftware.com/
]
Pentaho
Y
[http://
www.pentaho.com/
]
Share with
others
Local rendering
DatameerY
Y
[http://
www.datameer.com]
Tableau
[http://
Netez- SAP
Amaza
HANA zon
RedShift
53
Netez- SAP
Amaza
HANA zon
RedShift
PenY
Y
taho
[http://
www.pentaho.com/
]
Security
Datameer
Y
Y
[http://
(LDAP, Acwww.datameer.com]
tive Directory, Kerberos)
Y (Amazon) N
Tableau
Y
Y
[http://
(LDAP, Acwww.tableausoftware.com/
tive Directo]
ry, Kerberos)
Pentaho
Y
Y
[http://
(LDAP, Acwww.pentaho.com/
tive Directo]
ry, Kerberos)
Can validate data confirms to certain limits, can do cleansing and deduping.
Can share the results with others within or outside organization easily.
(Think like sharing a document on DropBox or Google Drive)
Local Rendering
You can slice and dice data on locally on a computer or tablet. This
uses the CPU power of the device and doesn't need a round-trip to a
'server' to process results. This can speed up ad-hoc data exploration
The platform allows customers to buy third party analytics app. Think
like APple App Store
54
High End
CPU
8 physical cores
12 physical cores
Memory
16 GB
48 GB
Disk
4 disks x 1TB = 4 TB
12 disks x 3TB = 36 TB
Network
1 GB Ethernet
10 GB Ethernet or Infiniband
So the high end machines have more memory. Plus, newer machines are packed with a lot more disks (e.g.
36 TB) -- high storage capacity.
Examples of Hadoop servers
HP
ProLiant
DL380
[http://h10010.www1.hp.com/wwpc/us/en/sm/
WF06a/15351-15351-3328412-241644-241475-4091412.html?dnr=1]
Dell C2100 series [http://www.dell.com/us/enterprise/p/poweredge-c2100/pd]
55
Supermicro
Hadoop
cro-hadoop.pdf]
series
[http://www.supermicro.com/products/nfo/files/hadoop/supermi-
So how does a large hadoop cluster looks like? Here is a picture of Yahoo's Hadoop cluster.
14.2. Software
Operating System
Hadoop runs well on Linux. The operating systems of choice are:
RedHat
(RHEL)
Enterprise
Linux
This is a well tested Linux distro that is geared for Enterprise. Comes
with RedHat support
CentOS
Source compatible distro with RHEL. Free. Very popular for running
Hadoop. Use a later version (version 6.x).
Ubuntu
The Server edition of Ubuntu is a good fit -- not the Desktop edition.
Long Term Support (LTS) releases are recommended, because they
continue to be updated for at least 2 years.
Java
Hadoop is written in Java. The recommended Java version is Oracle JDK 1.6 release and the recommended
minimum revision is 31 (v 1.6.31).
So what about OpenJDK? At this point the Sun JDK is the 'official' supported JDK. You can still run
Hadoop on OpenJDK (it runs reasonably well) but you are on your own for support :-)
56
57
Hadoop Challenges
58
How to
get experience working with large data sets
http://www.philwhln.com/how-to-get-experience-working-with-large-datasets
Quora
www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-public
[http://www.quora.com/
Data/Where-can-I-find-large-datasets-open-to-the-public] -- very good collection of links
DataMobs
http://datamob.org/datasets
KDNuggets
http://www.kdnuggets.com/datasets/
Research
Pipeline
http://www.researchpipeline.com/mediawiki/index.php?title=Main_Page
Google
Public Data Directory
http://www.google.com/publicdata/directory
Delicious
1
https://delicious.com/pskomoroch/dataset
Delcious
2
https://delicious.com/judell/publicdata?networkaddconfirm=judell
59
Data transfer is 'free' within Amazon eco system (within the same zone)
AWS data sets [http://aws.amazon.com/publicdatasets/]
InfoChimps
InfoChimps has data marketplace with a wide variety of data sets.
InfoChimps market place [http://www.infochimps.com/marketplace]
Comprehensive
Knowledge Archive Network
open source data portal platform
data sets available on datahub.io [http://datahub.io/] from ckan.org [http://ckan.org/]
Stanford
network data collection
http://snap.stanford.edu/data/index.html
Open Flights
Crowd sourced flight data http://openflights.org/
Wikipedia
data
wikipedia data [http://en.wikipedia.org/wiki/Wikipedia:Database_download]
OpenStreetMap.org
OpenStreetMap is a free worldwide map, created by people users. The geo and map data is available
for download.
openstreet.org [http://planet.openstreetmap.org/]
Natural
Earth Data
http://www.naturalearthdata.com/downloads/
Geocomm
http://data.geocomm.com/drg/index.html
60
Geonames
data
http://www.geonames.org/
US GIS
Data
Available from http://libremap.org/
Wikipedia
data
wikipedia data [http://en.wikipedia.org/wiki/Wikipedia:Database_download]
Google
N-gram data
google ngram [http://storage.googleapis.com/books/ngrams/books/datasetsv2.html]
Freebase
data
variety of data available from http://www.freebase.com/
UCI KDD
data
http://kdd.ics.uci.edu/
European
Parliament proceedings
proceedings [http://www.statmt.org/europarl/]
www.statmt.org]
US government
data
data.gov [http://www.data.gov/]
UK government
data
data.gov.uk [http://data.gov.uk/]
61
from
Statistical
machine
Translation
[http://
US Patent
and Trademark Office Filings
www.google.com/googlebooks/uspto.html [http://www.google.com/googlebooks/uspto.html]
National
Institute of Health
http://projectreporter.nih.gov/reporter.cfm
Aid information
http://www.aidinfo.org/data
UN Data
http://data.un.org/Explorer.aspx
62
63