Hadoop Blueprints
()
About this ebook
- Solve real-world business problems using Hadoop and other Big Data technologies
- Build efficient data lakes in Hadoop, and develop systems for various business cases like improving marketing campaigns, fraud detection, and more
- Power packed with six case studies to get you going with Hadoop for Business Intelligence
If you are interested in building efficient business solutions using Hadoop, this is the book for you This book assumes that you have basic knowledge of Hadoop, Java, and any scripting language.
Related to Hadoop Blueprints
Related ebooks
Hadoop Beginner's Guide Rating: 4 out of 5 stars4/5HDInsight Essentials - Second Edition Rating: 0 out of 5 stars0 ratingsLearning HBase Rating: 0 out of 5 stars0 ratingsLearn Hadoop in 24 Hours Rating: 0 out of 5 stars0 ratingsData Lake Development with Big Data Rating: 0 out of 5 stars0 ratingsMastering Hadoop Rating: 0 out of 5 stars0 ratingsMonitoring Hadoop Rating: 0 out of 5 stars0 ratingsHadoop Essentials Rating: 5 out of 5 stars5/5Apache Hive Essentials Rating: 0 out of 5 stars0 ratingsData Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL Rating: 0 out of 5 stars0 ratingsDynamoDB Applied Design Patterns Rating: 3 out of 5 stars3/5Exploring Hadoop Ecosystem (Volume 2): Stream Processing Rating: 0 out of 5 stars0 ratingsImplementing Cloud Design Patterns for AWS Rating: 0 out of 5 stars0 ratingsCassandra High Availability Rating: 5 out of 5 stars5/5Data Engineering on Azure Rating: 0 out of 5 stars0 ratingsData Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake Rating: 0 out of 5 stars0 ratingsExploring Hadoop Ecosystem (Volume 1): Batch Processing Rating: 0 out of 5 stars0 ratingsCloudera Administration Handbook Rating: 0 out of 5 stars0 ratingsFast Data Processing with Spark 2 - Third Edition Rating: 0 out of 5 stars0 ratingsHadoop Real-World Solutions Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsApache Hive Cookbook Rating: 0 out of 5 stars0 ratingsInstant Pentaho Data Integration Kitchen Rating: 0 out of 5 stars0 ratingsHadoop in Practice Rating: 0 out of 5 stars0 ratingsSnowflake Cookbook: Techniques for building modern cloud data warehousing solutions Rating: 0 out of 5 stars0 ratingsLearning Azure DocumentDB Rating: 0 out of 5 stars0 ratingsAzure in Action Rating: 0 out of 5 stars0 ratingsApache Cassandra Essentials Rating: 4 out of 5 stars4/5Practical OneOps Rating: 0 out of 5 stars0 ratings
Computers For You
Algorithms to Live By: The Computer Science of Human Decisions Rating: 4 out of 5 stars4/5Storytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Data Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5Get Into UX: A foolproof guide to getting your first user experience job Rating: 4 out of 5 stars4/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I. Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Learn Algorithmic Trading: Build and deploy algorithmic trading systems and strategies using Python and advanced data analysis Rating: 0 out of 5 stars0 ratingsGood Code, Bad Code: Think like a software engineer Rating: 5 out of 5 stars5/5The Alignment Problem: How Can Machines Learn Human Values? Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Learning the Chess Openings Rating: 5 out of 5 stars5/5UX/UI Design Playbook Rating: 4 out of 5 stars4/5Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning Rating: 5 out of 5 stars5/5Practical Data Analysis Rating: 4 out of 5 stars4/5Blender 3D Basics Beginner's Guide Second Edition Rating: 5 out of 5 stars5/5ChatGPT Rating: 3 out of 5 stars3/5Black Holes: The Key to Understanding the Universe Rating: 5 out of 5 stars5/5
Reviews for Hadoop Blueprints
0 ratings0 reviews
Book preview
Hadoop Blueprints - Anurag Shrivastava
Table of Contents
Hadoop Blueprints
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Hadoop and Big Data
The beginning of the big data problem
Limitations of RDBMS systems
Scaling out a database on Google
Parallel processing of large datasets
Building open source Hadoop
Enterprise Hadoop
Social media and mobile channels
Data storage cost reduction
Enterprise software vendors
Pure Play Hadoop vendors
Cloud Hadoop vendors
The design of the Hadoop system
The Hadoop Distributed File System (HDFS)
Data organization in HDFS
HDFS file management commands
NameNode and DataNodes
Metadata store in NameNode
Preventing a single point of failure with Hadoop HA
Checkpointing process
Data Store on a DataNode
Handshakes and heartbeats
MapReduce
The execution model of MapReduce Version 1
Apache YARN
Building a MapReduce Version 2 program
Problem statement
Solution workflow
Getting the dataset
Studying the dataset
Cleaning the dataset
Loading the dataset on the HDFS
Starting with a MapReduce program
Installing Eclipse
Creating a project in Eclipse
Coding and building a MapReduce program
Run the MapReduce program locally
Examine the result
Run the MapReduce program on Hadoop
Further processing of results
Hadoop platform tools
Data ingestion tools
Data access tools
Monitoring tools
Data governance tools
Big data use cases
Creating a 360 degree view of a customer
Fraud detection systems for banks
Marketing campaign planning
Churn detection in telecom
Analyzing sensor data
Building a data lake
The architecture of Hadoop-based systems
Lambda architecture
Summary
2. A 360-Degree View of the Customer
Capturing business information
Collecting data from data sources
Creating a data processing approach
Presenting the results
Setting up the technology stack
Tools used
Installing Hortonworks Sandbox
Creating user accounts
Exploring HUE
Exploring MYSQL and the HIVE command line
Exploring Sqoop at the command line
Test driving Hive and Sqoop
Querying data using Hive
Importing data in Hive using Sqoop
Engineering the solution
Datasets
Loading customer master data into Hadoop
Loading web logs into Hadoop
Loading tweets into Hadoop
Creating the 360-degree view
Exporting data from Hadoop
Presenting the view
Building a web application
Installing Node.js
Coding the web application in Node.js
Summary
3. Building a Fraud Detection System
Understanding the business problem
Selecting and cleansing the dataset
Finding relevant fields
Machine learning for fraud detection
Clustering as an unsupervised machine learning method
Designing the high-level architecture
Introducing Apache Spark
Apache Spark architecture
Resilient Distributed Datasets
Transformation functions
Actions
Test driving Apache Spark
Calculating the yearly average stock prices using Spark
Apache Spark 2.X
Understanding MLib
Test driving K-means using MLib
Creating our fraud detection model
Building our K-means clustering model
Processing the data
Putting the fraud detection model to use
Generating a data stream
Processing the data stream using Spark streaming
Putting the model to use
Scaling the solution
Summary
4. Marketing Campaign Planning
Creating the solution outline
Supervised learning
Tree-structure models for classification
Finding the right dataset
Setting the up the solution architecture
Coupon scan at POS
Join and transform
Train the classification model
Scoring
Mail merge
Building the machine learning model
Introducing BigML
Model building steps
Sign up as a user on BigML site
Upload the data file
Creating the dataset
Building the classification model
Downloading the classification model
Running the Model on Hadoop
Creating the target List
Post campaign activities
Summary
5. Churn Detection
A business case for churn detection
Creating the solution outline
Building a predictive model using Hadoop
Bayes' Theorem
Playing with the Bayesian predictor
Running a Node.js-based Bayesian predictor
Understanding the predictor code
Limitations of our solution
Building a churn predictor using Hadoop
Synthetic data generation tools
Preparing a synthetic historical churn dataset
The processing approach
Running the MapReduce program
Understanding the frequency counter code
Putting the model to use
Integrating the churn predictor
Summary
6. Analyze Sensor Data Using Hadoop
A business case for sensor data analytics
Creating the solution outline
Technology stack
Kafka
Flume
HDFS
Hive
Open TSDB
HBase
Grafana
Batch data analytics
Loading streams of sensor data from Kafka topics to HDFS
Using Hive to perform analytics on inserted data
Data visualization in MS Excel
Stream data analytics
Loading streams of sensor data
Data visualization using Grafana
Summary
7. Building a Data Lake
Data lake building blocks
Ingestion tier
Storage tier
Insights tier
Ops facilities
Limitation of open source Hadoop ecosystem tools
Hadoop security
HDFS permissions model
Fine-grained permissions with HDFS ACLs
Apache Ranger
Installing Apache Ranger
Test driving Apache Ranger
Define services and access policies
Examine the audit logs
Viewing users and groups in Ranger
Data Lake security with Apache Ranger
Apache Flume
Understanding the Design of Flume
Installing Apache Flume
Running Apache Flume
Apache Zeppelin
Installation of Apache Zeppelin
Test driving Zeppelin
Exploring data visualization features of Zeppelin
Define the gold price movement table in Hive
Load gold price history in the Table
Run a select query
Plot price change per month
Running the paragraph
Zeppelin in Data Lake
Technology stack for Data Lake
Data Lake business requirements
Understanding the business requirements
Understanding the IT systems and security
Designing the data pipeline
Building the data pipeline
Setting up the access control
Synchronizing the users and groups in Ranger
Setting up data access policies in Ranger
Restricting the access in Zeppelin
Testing our data pipeline
Scheduling the data loading
Refining the business requirements
Implementing the new requirements
Loading the stock holding data in Data Lake
Restricting the access to stock holding data in Data Lake
Testing the Loaded Data with Zeppelin
Adding stock feed in the Data Lake
Fetching data from Yahoo Service
Configuring Flume
Running Flume as Stock Feeder to Data Lake
Transforming the data in Data Lake
Growing Data Lake
Summary
8. Future Directions
Hadoop solutions team
The role of the data engineer
Data science for non-experts
From the data science model to business value
Hadoop on Cloud
Deploying Hadoop on cloud servers
Using Hadoop as a service
NoSQL databases
Types of NoSQL databases
Common observations about NoSQL databases
In-memory databases
Apache Ignite as an in-memory database
Apache Ignite as a Hadoop accelerator
Apache Spark versus Apache Ignite
Summary
Hadoop Blueprints
Hadoop Blueprints
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2016
Production reference: 1270916
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78398-030-7
www.packtpub.com
Credits
About the Authors
Anurag Shrivastava is an entrepreneur, blogger, and manager living in Almere near Amsterdam in the Netherlands. He started his IT journey by writing a small poker program on a mainframe computer 30 years back, and he fell in love with software technology. In his 24-year career in IT, he has worked for companies of various sizes, ranging from Internet start-ups to large system integrators in Europe.
Anurag kick-started the Agile software movement in North India when he set up the Indian business unit for the Dutch software consulting company Xebia. He led the growth of Xebia India as the managing director of the company for over 6 years and made the company a well-known name in the Agile consulting space in India. He also started the Agile NCR Conference, which has become a heavily visited annual event on Agile best practices, in the New Delhi Capital Region.
Anurag became active in the big data space when he joined ING Bank in Amsterdam as the manager of the customer intelligence department, where he set up their first Hadoop cluster and implemented several transformative technologies, such as Netezza and R, in his department. He is now active in the payment technology and APIs, using technologies such as Node.js and MongoDB.
Anurag loves to cycle on the reclaimed island of Flevoland in the Netherlands. He also likes listening to Hindi film music.
I would like to thank my wife, Anjana, and daughter, Anika, for putting up with my late-night writing sessions and skipping of weekend breaks. I also would like to thank my parents and teachers for their guidance in life.
I would like to express my gratitude to colleagues at Xebia and Daan Teunissen, where I learned about the value of technical writing from colleagues, who inspired me to work on this book project. I would like to thank all the mentors that I’ve had over the years. I would like to express thanks and gratitude to Amir Arooni, my boss at ING Bank, who provided me time and opportunity to work on big data and, later on, this book. I also give thanks to the Packt team and the coauthor, Tanmay, who provided help and guidance in the whole process.
Tanmay Deshpande is a Hadoop and big data evangelist. He's interested in a wide range of technologies, such as Apache Spark, Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, and cloud computing. He has vast experience in application development in various domains, such as finance, telecoms, manufacturing, security, and retail. He enjoys solving machine learning problems and spends his time reading anything he can get his hands on. He has a great interest in open source technologies and promotes them through his lectures. He has been invited to various computer science colleges to conduct brainstorming sessions with students on the latest technologies. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. Tanmay is currently working with Schlumberger as the lead big data developer. Before Schlumberger, Tanmay worked with Lumiata, Symantec, and Infosys.
Tanmay is the author of books such as Hadoop Real World Solutions Cookbook-Second Edition, DynamoDB Cookbook, and Mastering DynamoDB, all by Packt Publishing.
I would like to thank my family and the Almighty for supporting me throughout my all adventures.
About the Reviewers
Dedunu Dhananjaya is a senior software engineer in personalized learning and analytics at Pearson. He is interested in data science and analytics. Prior to Pearson, Dedunu worked at Zaizi, LIRNEasia, and WSO2. Currently, he is reading his masters in applied statistics at the University of Colombo.
Wissem El Khlifi is the first Oracle ACE from Spain and an Oracle Certified Professional DBA with over 12 years of IT experience.
He earned his computer science engineering degree from FST Tunisia and master's in computer science as well as in big data science analytics and management from UPC Barcelona. His areas of interest are Linux system administration, high availability Oracle databases, big data NOSQL database management, and big data analysis.
His career has included the following roles: Oracle and Java analyst/programmer, Oracle DBA, architect, team leader, and big data scientist. He currently works as a senior database and applications engineer for Schneider Electric/APC. He writes numerous articles on his website, http://www.oracle-class.com, and his Twitter handle is @orawiss.
Randal Scott King is the managing partner of Brilliant Data, a consulting firm specializing in data analytics. In his years of consulting, Scott has amassed an impressive list of clientele, from mid-market leaders to Fortune 500 household names. In addition to Hadoop Blueprints, he has also served as technical reviewer for other Packt Publishing books on big data and has authored the instructional videos Learning Hadoop 2 and Mastering Hadoop. Scott lives just outside Atlanta, GA, with his children. You can visit his blog at http://www.randalscottking.com.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Preface
This book covers the application of Hadoop and its ecosystem of tools to solve business problems. Hadoop has fast emerged as the leading big data platform and finds applications in many industries where massive datasets or big data has to be stored and analyzed. Hadoop lowers the cost of investment in the storage. It supports the generation of new business insights, which was not possible earlier because of the massive volumes and computing capacity required to process such information. This book covers several business cases to build solutions to business problems. Each solution covered in this book has been built using Hadoop and HDFS and the set of tools from the Hadoop ecosystem.
What this book covers
Chapter 1, Hadoop and Big Data, goes over how Hadoop has played a pivotal role in making several Internet businesses successful with big data from its beginnings in the previous decade. This chapter covers a brief history and the story of the evolution of Hadoop. It covers the Hadoop architecture and the MapReduce data processing framework. It introduces basic Hadoop programming in Java and provides a detailed overview of the business cases covered in the following chapters of this book. This chapter builds the foundation for understanding the rest of the book.
Chapter 2, A 360-Degree View of the Customer, covers building a 360-degree view of the customer. A good 360-degree view requires the integration of data from various sources. The data sources are database management systems storing master data and transactional data. Other data sources might include data captured from social media feeds. In this chapter, we will be integrating data from CRM systems, web logs, and Twitter feeds to build the 360-degree view and present it using a simple web interface. We will learn about Apache Sqoop and Apache Hive in the process of building our solution.
Chapter 3, Building a Fraud Detection System, covers the building of a real-time fraud detection system. This system predicts whether a financial transaction could be fraudulent by applying a clustering algorithm on a stream of transactions. We will learn about the architecture of the system and the coding steps involved in building the system. We will learn about Apache Spark in the process of building our solution.
Chapter 4, Marketing Campaign Planning, shows how to build a system that can improve the effectiveness of marketing campaigns. This system is a batch analytics system that uses historical campaign-response data to predict who is going to respond to a marketing folder. We will see how we can build a predictive model and use it to predict who is going to respond to which folder in our marketing campaign. We will learn about BigML in the process of building our solution.
Chapter 5, Churn Detection, explains how to use Hadoop to predict which customers are likely to move over to another company. We will cover the business case of a mobile telecom provider who would like to detect the customers who are likely to churn. These customers are given special incentives so that they can stay with the same provider. We will apply Bayes' Theorem to calculate the likelihood of churn. The model for churn detection will be built using Hadoop. We will learn about writing MapReduce programs in Java in the process of building our solution.
Chapter 6, Analyze Sensor Data Using Hadoop, is about how to build a system to analyze sensor data. Nowadays, sensors are considered an important source of big data. We will learn how Hadoop and big-data technologies can be helpful in the Internet of Things (IoT) domain. IoT is a network of connected devices that generate data through sensors. We will build a system to monitor the quality of the environment, such as humidity and temperature, in a factory. We will introduce Apache Kafka, Grafana, and OpenTSDB tools in the process of building the solution.
Chapter 7, Building a Data Lake, takes you through building a data lake using Hadoop and several other tools to import data in a data lake and provide secure access to the data. Data lakes are a popular business case for Hadoop. In a data lake, we store data from multiple sources to build a single source of data for the enterprise and build a security layer around it. We will learn about Apache Ranger, Apache Flume, and Apache Zeppelin in the process of building our solution.
Chapter 8, Future Directions, covers four separate topics that are relevant to Hadoop-based projects. These topics are building a Hadoop solutions team, Hadoop on the cloud, NoSQL databases, and in-memory databases. This chapter does not include any coding examples, unlike the other chapters. These fours topics have been covered in the essay form so that you can explore them further.
What you need for this book
Code and data samples have been provided for every chapter. We have used Hadoop version 2.7.x in this book. All the coding samples have been developed and tested on the stock (Apache Software Foundation) version of Hadoop and other tools. You can download these tools from the Apache Software Foundation website. In Chapter 2, A 360-Degree View of the Customer, we have used Hortonworks Data Platform (HDP) 2.3. HDP 2.3 is a bundle of Hadoop and several other tools from the ecosystem in a convenient virtual machine image that can run on VirtualBox or VMWare. You can download this virtual image from the website of Hortonworks at http://hortonworks.com/downloads/#data-platform. Due to the fast-evolving nature of Hadoop and its ecosystem of tools, you might find that newer versions are available than the ones used in this book. The specific versions of the tools needed for the examples have been mentioned in the chapters where they are first introduced.
Who this book is for
This book is intended for software developers, architects, and engineering managers who are evaluating Hadoop as a technology to build business solutions using big data. This book explains how the tools in the Hadoop ecosystem can be combined to create a useful solution, and therefore, it is particularly useful for those who would like to understand how various technologies can be integrated without understanding any particular tool in depth.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: You can also run the transmodel.py program using the Python command-line interpreter pyspark.
A block of code is set as follows:
#!/bin/bash
while [ true ]
do
echo 1 2 $RANDOM
sleep 1
done
Any command-line input or output is written as follows:
>>> from pyspark.mllib.clustering import KMeans, KMeansModel
>>> from numpy import array
New terms and important words are shown in bold.
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have