Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $9.99/month after trial. Cancel anytime.

Hadoop Blueprints
Hadoop Blueprints
Hadoop Blueprints
Ebook570 pages3 hours

Hadoop Blueprints

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book
  • Solve real-world business problems using Hadoop and other Big Data technologies
  • Build efficient data lakes in Hadoop, and develop systems for various business cases like improving marketing campaigns, fraud detection, and more
  • Power packed with six case studies to get you going with Hadoop for Business Intelligence
Who This Book Is For

If you are interested in building efficient business solutions using Hadoop, this is the book for you This book assumes that you have basic knowledge of Hadoop, Java, and any scripting language.

LanguageEnglish
Release dateSep 30, 2016
ISBN9781783980314
Hadoop Blueprints

Related to Hadoop Blueprints

Related ebooks

Computers For You

View More

Related articles

Reviews for Hadoop Blueprints

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Hadoop Blueprints - Anurag Shrivastava

    Table of Contents

    Hadoop Blueprints

    Credits

    About the Authors

    About the Reviewers

    www.PacktPub.com

    Why subscribe?

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Errata

    Piracy

    Questions

    1. Hadoop and Big Data

    The beginning of the big data problem

    Limitations of RDBMS systems

    Scaling out a database on Google

    Parallel processing of large datasets

    Building open source Hadoop

    Enterprise Hadoop

    Social media and mobile channels

    Data storage cost reduction

    Enterprise software vendors

    Pure Play Hadoop vendors

    Cloud Hadoop vendors

    The design of the Hadoop system

    The Hadoop Distributed File System (HDFS)

    Data organization in HDFS

    HDFS file management commands

    NameNode and DataNodes

    Metadata store in NameNode

    Preventing a single point of failure with Hadoop HA

    Checkpointing process

    Data Store on a DataNode

    Handshakes and heartbeats

    MapReduce

    The execution model of MapReduce Version 1

    Apache YARN

    Building a MapReduce Version 2 program

    Problem statement

    Solution workflow

    Getting the dataset

    Studying the dataset

    Cleaning the dataset

    Loading the dataset on the HDFS

    Starting with a MapReduce program

    Installing Eclipse

    Creating a project in Eclipse

    Coding and building a MapReduce program

    Run the MapReduce program locally

    Examine the result

    Run the MapReduce program on Hadoop

    Further processing of results

    Hadoop platform tools

    Data ingestion tools

    Data access tools

    Monitoring tools

    Data governance tools

    Big data use cases

    Creating a 360 degree view of a customer

    Fraud detection systems for banks

    Marketing campaign planning

    Churn detection in telecom

    Analyzing sensor data

    Building a data lake

    The architecture of Hadoop-based systems

    Lambda architecture

    Summary

    2. A 360-Degree View of the Customer

    Capturing business information

    Collecting data from data sources

    Creating a data processing approach

    Presenting the results

    Setting up the technology stack

    Tools used

    Installing Hortonworks Sandbox

    Creating user accounts

    Exploring HUE

    Exploring MYSQL and the HIVE command line

    Exploring Sqoop at the command line

    Test driving Hive and Sqoop

    Querying data using Hive

    Importing data in Hive using Sqoop

    Engineering the solution

    Datasets

    Loading customer master data into Hadoop

    Loading web logs into Hadoop

    Loading tweets into Hadoop

    Creating the 360-degree view

    Exporting data from Hadoop

    Presenting the view

    Building a web application

    Installing Node.js

    Coding the web application in Node.js

    Summary

    3. Building a Fraud Detection System

    Understanding the business problem

    Selecting and cleansing the dataset

    Finding relevant fields

    Machine learning for fraud detection

    Clustering as an unsupervised machine learning method

    Designing the high-level architecture

    Introducing Apache Spark

    Apache Spark architecture

    Resilient Distributed Datasets

    Transformation functions

    Actions

    Test driving Apache Spark

    Calculating the yearly average stock prices using Spark

    Apache Spark 2.X

    Understanding MLib

    Test driving K-means using MLib

    Creating our fraud detection model

    Building our K-means clustering model

    Processing the data

    Putting the fraud detection model to use

    Generating a data stream

    Processing the data stream using Spark streaming

    Putting the model to use

    Scaling the solution

    Summary

    4. Marketing Campaign Planning

    Creating the solution outline

    Supervised learning

    Tree-structure models for classification

    Finding the right dataset

    Setting the up the solution architecture

    Coupon scan at POS

    Join and transform

    Train the classification model

    Scoring

    Mail merge

    Building the machine learning model

    Introducing BigML

    Model building steps

    Sign up as a user on BigML site

    Upload the data file

    Creating the dataset

    Building the classification model

    Downloading the classification model

    Running the Model on Hadoop

    Creating the target List

    Post campaign activities

    Summary

    5. Churn Detection

    A business case for churn detection

    Creating the solution outline

    Building a predictive model using Hadoop

    Bayes' Theorem

    Playing with the Bayesian predictor

    Running a Node.js-based Bayesian predictor

    Understanding the predictor code

    Limitations of our solution

    Building a churn predictor using Hadoop

    Synthetic data generation tools

    Preparing a synthetic historical churn dataset

    The processing approach

    Running the MapReduce program

    Understanding the frequency counter code

    Putting the model to use

    Integrating the churn predictor

    Summary

    6. Analyze Sensor Data Using Hadoop

    A business case for sensor data analytics

    Creating the solution outline

    Technology stack

    Kafka

    Flume

    HDFS

    Hive

    Open TSDB

    HBase

    Grafana

    Batch data analytics

    Loading streams of sensor data from Kafka topics to HDFS

    Using Hive to perform analytics on inserted data

    Data visualization in MS Excel

    Stream data analytics

    Loading streams of sensor data

    Data visualization using Grafana

    Summary

    7. Building a Data Lake

    Data lake building blocks

    Ingestion tier

    Storage tier

    Insights tier

    Ops facilities

    Limitation of open source Hadoop ecosystem tools

    Hadoop security

    HDFS permissions model

    Fine-grained permissions with HDFS ACLs

    Apache Ranger

    Installing Apache Ranger

    Test driving Apache Ranger

    Define services and access policies

    Examine the audit logs

    Viewing users and groups in Ranger

    Data Lake security with Apache Ranger

    Apache Flume

    Understanding the Design of Flume

    Installing Apache Flume

    Running Apache Flume

    Apache Zeppelin

    Installation of Apache Zeppelin

    Test driving Zeppelin

    Exploring data visualization features of Zeppelin

    Define the gold price movement table in Hive

    Load gold price history in the Table

    Run a select query

    Plot price change per month

    Running the paragraph

    Zeppelin in Data Lake

    Technology stack for Data Lake

    Data Lake business requirements

    Understanding the business requirements

    Understanding the IT systems and security

    Designing the data pipeline

    Building the data pipeline

    Setting up the access control

    Synchronizing the users and groups in Ranger

    Setting up data access policies in Ranger

    Restricting the access in Zeppelin

    Testing our data pipeline

    Scheduling the data loading

    Refining the business requirements

    Implementing the new requirements

    Loading the stock holding data in Data Lake

    Restricting the access to stock holding data in Data Lake

    Testing the Loaded Data with Zeppelin

    Adding stock feed in the Data Lake

    Fetching data from Yahoo Service

    Configuring Flume

    Running Flume as Stock Feeder to Data Lake

    Transforming the data in Data Lake

    Growing Data Lake

    Summary

    8. Future Directions

    Hadoop solutions team

    The role of the data engineer

    Data science for non-experts

    From the data science model to business value

    Hadoop on Cloud

    Deploying Hadoop on cloud servers

    Using Hadoop as a service

    NoSQL databases

    Types of NoSQL databases

    Common observations about NoSQL databases

    In-memory databases

    Apache Ignite as an in-memory database

    Apache Ignite as a Hadoop accelerator

    Apache Spark versus Apache Ignite

    Summary

    Hadoop Blueprints


    Hadoop Blueprints

    Copyright © 2016 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: September 2016

    Production reference: 1270916

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham 

    B3 2PB, UK.

    ISBN 978-1-78398-030-7

    www.packtpub.com

    Credits

    About the Authors

    Anurag Shrivastava is an entrepreneur, blogger, and manager living in Almere near Amsterdam in the Netherlands. He started his IT journey by writing a small poker program on a mainframe computer 30 years back, and he fell in love with software technology. In his 24-year career in IT, he has worked for companies of various sizes, ranging from Internet start-ups to large system integrators in Europe.

    Anurag kick-started the Agile software movement in North India when he set up the Indian business unit for the Dutch software consulting company Xebia. He led the growth of Xebia India as the managing director of the company for over 6 years and made the company a well-known name in the Agile consulting space in India. He also started the Agile NCR Conference, which has become a heavily visited annual event on Agile best practices, in the New Delhi Capital Region.

    Anurag became active in the big data space when he joined ING Bank in Amsterdam as the manager of the customer intelligence department, where he set up their first Hadoop cluster and implemented several transformative technologies, such as Netezza and R, in his department. He is now active in the payment technology and APIs, using technologies such as Node.js and MongoDB.

    Anurag loves to cycle on the reclaimed island of Flevoland in the Netherlands. He also likes listening to Hindi film music.

    I would like to thank my wife, Anjana, and daughter, Anika, for putting up with my late-night writing sessions and skipping of weekend breaks. I also would like to thank my parents and teachers for their guidance in life.

    I would like to express my gratitude to colleagues at Xebia and Daan Teunissen, where I learned about the value of technical writing from colleagues, who inspired me to work on this book project. I would like to thank all the mentors that I’ve had over the years. I would like to express thanks and gratitude to Amir Arooni, my boss at ING Bank, who provided me time and opportunity to work on big data and, later on, this book. I also give thanks to the Packt team and the coauthor, Tanmay, who provided help and guidance in the whole process.

    Tanmay Deshpande is a Hadoop and big data evangelist. He's interested in a wide range of technologies, such as Apache Spark, Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, and cloud computing. He has vast experience in application development in various domains, such as finance, telecoms, manufacturing, security, and retail. He enjoys solving machine learning problems and spends his time reading anything he can get his hands on. He has a great interest in open source technologies and promotes them through his lectures. He has been invited to various computer science colleges to conduct brainstorming sessions with students on the latest technologies. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. Tanmay is currently working with Schlumberger as the lead big data developer. Before Schlumberger, Tanmay worked with Lumiata, Symantec, and Infosys.

    Tanmay is the author of books such as Hadoop Real World Solutions Cookbook-Second Edition, DynamoDB Cookbook, and Mastering DynamoDB, all by Packt Publishing.

    I would like to thank my family and the Almighty for supporting me throughout my all adventures.

    About the Reviewers

    Dedunu Dhananjaya is a senior software engineer in personalized learning and analytics at Pearson. He is interested in data science and analytics. Prior to Pearson, Dedunu worked at Zaizi, LIRNEasia, and WSO2. Currently, he is reading his masters in applied statistics at the University of Colombo.

    Wissem El Khlifi is the first Oracle ACE from Spain and an Oracle Certified Professional DBA with over 12 years of IT experience.

    He earned his computer science engineering degree from FST Tunisia and master's in computer science as well as in big data science analytics and management from UPC Barcelona. His areas of interest are Linux system administration, high availability Oracle databases, big data NOSQL database management, and big data analysis.

    His career has included the following roles: Oracle and Java analyst/programmer, Oracle DBA, architect, team leader, and big data scientist. He currently works as a senior database and applications engineer for Schneider Electric/APC. He writes numerous articles on his website, http://www.oracle-class.com, and his Twitter handle is @orawiss.

    Randal Scott King is the managing partner of Brilliant Data, a consulting firm specializing in data analytics. In his years of consulting, Scott has amassed an impressive list of clientele, from mid-market leaders to Fortune 500 household names. In addition to Hadoop Blueprints, he has also served as technical reviewer for other Packt Publishing books on big data and has authored the instructional videos Learning Hadoop 2 and Mastering Hadoop. Scott lives just outside Atlanta, GA, with his children. You can visit his blog at http://www.randalscottking.com.

    www.PacktPub.com

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www.packtpub.com/mapt

    Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Preface

    This book covers the application of Hadoop and its ecosystem of tools to solve business problems. Hadoop has fast emerged as the leading big data platform and finds applications in many industries where massive datasets or big data has to be stored and analyzed. Hadoop lowers the cost of investment in the storage. It supports the generation of new business insights, which was not possible earlier because of the massive volumes and computing capacity required to process such information. This book covers several business cases to build solutions to business problems. Each solution covered in this book has been built using Hadoop and HDFS and the set of tools from the Hadoop ecosystem.

    What this book covers

    Chapter 1, Hadoop and Big Data, goes over how Hadoop has played a pivotal role in making several Internet businesses successful with big data from its beginnings in the previous decade. This chapter covers a brief history and the story of the evolution of Hadoop. It covers the Hadoop architecture and the MapReduce data processing framework. It introduces basic Hadoop programming in Java and provides a detailed overview of the business cases covered in the following chapters of this book. This chapter builds the foundation for understanding the rest of the book.

    Chapter 2, A 360-Degree View of the Customer, covers building a 360-degree view of the customer. A good 360-degree view requires the integration of data from various sources. The data sources are database management systems storing master data and transactional data. Other data sources might include data captured from social media feeds. In this chapter, we will be integrating data from CRM systems, web logs, and Twitter feeds to build the 360-degree view and present it using a simple web interface. We will learn about Apache Sqoop and Apache Hive in the process of building our solution.

    Chapter 3, Building a Fraud Detection System, covers the building of a real-time fraud detection system. This system predicts whether a financial transaction could be fraudulent by applying a clustering algorithm on a stream of transactions. We will learn about the architecture of the system and the coding steps involved in building the system. We will learn about Apache Spark in the process of building our solution.

    Chapter 4, Marketing Campaign Planning, shows how to build a system that can improve the effectiveness of marketing campaigns. This system is a batch analytics system that uses historical campaign-response data to predict who is going to respond to a marketing folder. We will see how we can build a predictive model and use it to predict who is going to respond to which folder in our marketing campaign. We will learn about BigML in the process of building our solution.

    Chapter 5, Churn Detection, explains how to use Hadoop to predict which customers are likely to move over to another company. We will cover the business case of a mobile telecom provider who would like to detect the customers who are likely to churn. These customers are given special incentives so that they can stay with the same provider. We will apply Bayes' Theorem to calculate the likelihood of churn. The model for churn detection will be built using Hadoop. We will learn about writing MapReduce programs in Java in the process of building our solution.

    Chapter 6, Analyze Sensor Data Using Hadoop, is about how to build a system to analyze sensor data. Nowadays, sensors are considered an important source of big data. We will learn how Hadoop and big-data technologies can be helpful in the Internet of Things (IoT) domain. IoT is a network of connected devices that generate data through sensors. We will build a system to monitor the quality of the environment, such as humidity and temperature, in a factory. We will introduce Apache Kafka, Grafana, and OpenTSDB tools in the process of building the solution.

    Chapter 7, Building a Data Lake, takes you through building a data lake using Hadoop and several other tools to import data in a data lake and provide secure access to the data. Data lakes are a popular business case for Hadoop. In a data lake, we store data from multiple sources to build a single source of data for the enterprise and build a security layer around it. We will learn about Apache Ranger, Apache Flume, and Apache Zeppelin in the process of building our solution.

    Chapter 8, Future Directions, covers four separate topics that are relevant to Hadoop-based projects. These topics are building a Hadoop solutions team, Hadoop on the cloud, NoSQL databases, and in-memory databases. This chapter does not include any coding examples, unlike the other chapters. These fours topics have been covered in the essay form so that you can explore them further.

    What you need for this book

    Code and data samples have been provided for every chapter. We have used Hadoop version 2.7.x in this book. All the coding samples have been developed and tested on the stock (Apache Software Foundation) version of Hadoop and other tools. You can download these tools from the Apache Software Foundation website. In Chapter 2, A 360-Degree View of the Customer, we have used Hortonworks Data Platform (HDP) 2.3. HDP 2.3 is a bundle of Hadoop and several other tools from the ecosystem in a convenient virtual machine image that can run on VirtualBox or VMWare. You can download this virtual image from the website of Hortonworks at http://hortonworks.com/downloads/#data-platform. Due to the fast-evolving nature of Hadoop and its ecosystem of tools, you might find that newer versions are available than the ones used in this book. The specific versions of the tools needed for the examples have been mentioned in the chapters where they are first introduced.

    Who this book is for

    This book is intended for software developers, architects, and engineering managers who are evaluating Hadoop as a technology to build business solutions using big data. This book explains how the tools in the Hadoop ecosystem can be combined to create a useful solution, and therefore, it is particularly useful for those who would like to understand how various technologies can be integrated without understanding any particular tool in depth.

    Conventions

    In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

    Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: You can also run the transmodel.py program using the Python command-line interpreter pyspark.

    A block of code is set as follows:

    #!/bin/bash

    while [ true ]

    do

    echo 1 2 $RANDOM 

    sleep 1

    done

    Any command-line input or output is written as follows:

    >>> from pyspark.mllib.clustering import KMeans, KMeansModel

    >>> from numpy import array

    New terms and important words are shown in bold.

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

    To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have

    Enjoying the preview?
    Page 1 of 1