Ebook570 pages3 hours

Hadoop Blueprints

Name: Hadoop Blueprints
Author: Anurag Shrivastava
ISBN: 9781783980314

By Anurag Shrivastava and Deshpande Tanmay

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book

Solve real-world business problems using Hadoop and other Big Data technologies
Build efficient data lakes in Hadoop, and develop systems for various business cases like improving marketing campaigns, fraud detection, and more
Power packed with six case studies to get you going with Hadoop for Business Intelligence

Who This Book Is For

If you are interested in building efficient business solutions using Hadoop, this is the book for you This book assumes that you have basic knowledge of Hadoop, Java, and any scripting language.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateSep 30, 2016

ISBN9781783980314

Author

Anurag Shrivastava

Related authors

Skip carousel

Related to Hadoop Blueprints

Related ebooks

Skip carousel

Hadoop Beginner's Guide
Ebook
Hadoop Beginner's Guide
byGarry Turkington
Rating: 4 out of 5 stars
4/5
HDInsight Essentials - Second Edition
Ebook
HDInsight Essentials - Second Edition
byRajesh Nadipalli
Rating: 0 out of 5 stars
0 ratings
Learning HBase
Ebook
Learning HBase
byShashwat Shriparv
Rating: 0 out of 5 stars
0 ratings
Learn Hadoop in 24 Hours
Ebook
Learn Hadoop in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Data Lake Development with Big Data
Ebook
Data Lake Development with Big Data
byPasupuleti Pradeep
Rating: 0 out of 5 stars
0 ratings
Mastering Hadoop
Ebook
Mastering Hadoop
bySandeep Karanth
Rating: 0 out of 5 stars
0 ratings
Monitoring Hadoop
Ebook
Monitoring Hadoop
byGurmukh Singh
Rating: 0 out of 5 stars
0 ratings
Hadoop Essentials
Ebook
Hadoop Essentials
byShiva Achari
Rating: 5 out of 5 stars
5/5
Apache Hive Essentials
Ebook
Apache Hive Essentials
byDayong Du
Rating: 0 out of 5 stars
0 ratings
Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL
Ebook
Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL
byRoberto Zagni
Rating: 0 out of 5 stars
0 ratings
DynamoDB Applied Design Patterns
Ebook
DynamoDB Applied Design Patterns
byUchit Vyas
Rating: 3 out of 5 stars
3/5
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Ebook
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
byWei Liu
Rating: 0 out of 5 stars
0 ratings
Implementing Cloud Design Patterns for AWS
Ebook
Implementing Cloud Design Patterns for AWS
byMarcus Young
Rating: 0 out of 5 stars
0 ratings
Cassandra High Availability
Ebook
Cassandra High Availability
byRobbie Strickland
Rating: 5 out of 5 stars
5/5
Data Engineering on Azure
Ebook
Data Engineering on Azure
byVlad Riscutia
Rating: 0 out of 5 stars
0 ratings
Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake
Ebook
Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake
byPulkit Chadha
Rating: 0 out of 5 stars
0 ratings
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Ebook
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
byWei Liu
Rating: 0 out of 5 stars
0 ratings
Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way
Ebook
Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way
byManoj Kukreja
Rating: 0 out of 5 stars
0 ratings
Cloudera Administration Handbook
Ebook
Cloudera Administration Handbook
byRohit Menon
Rating: 0 out of 5 stars
0 ratings
Fast Data Processing with Spark 2 - Third Edition
Ebook
Fast Data Processing with Spark 2 - Third Edition
byKrishna Sankar
Rating: 0 out of 5 stars
0 ratings
Hadoop Real-World Solutions Cookbook - Second Edition
Ebook
Hadoop Real-World Solutions Cookbook - Second Edition
byDeshpande Tanmay
Rating: 0 out of 5 stars
0 ratings
Apache Hive Cookbook
Ebook
Apache Hive Cookbook
byShrey Mehrotra
Rating: 0 out of 5 stars
0 ratings
Instant Pentaho Data Integration Kitchen
Ebook
Instant Pentaho Data Integration Kitchen
bySergio Ramazzina
Rating: 0 out of 5 stars
0 ratings
Hadoop in Practice
Ebook
Hadoop in Practice
byAlex Holmes
Rating: 0 out of 5 stars
0 ratings
Snowflake Cookbook: Techniques for building modern cloud data warehousing solutions
Ebook
Snowflake Cookbook: Techniques for building modern cloud data warehousing solutions
byHamid Mahmood Qureshi
Rating: 0 out of 5 stars
0 ratings
Learning Azure DocumentDB
Ebook
Learning Azure DocumentDB
byBecker Riccardo
Rating: 0 out of 5 stars
0 ratings
Azure in Action
Ebook
Azure in Action
byBrian Prince
Rating: 0 out of 5 stars
0 ratings
Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
Ebook
Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
byVinicius Aquino do Vale
Rating: 0 out of 5 stars
0 ratings
Apache Cassandra Essentials
Ebook
Apache Cassandra Essentials
byPadalia Nitin
Rating: 4 out of 5 stars
4/5
Practical OneOps
Ebook
Practical OneOps
byNilesh Nimkar
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

Algorithms to Live By: The Computer Science of Human Decisions
Ebook
Algorithms to Live By: The Computer Science of Human Decisions
byBrian Christian
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Storytelling with Data: Let's Practice!
Ebook
Storytelling with Data: Let's Practice!
byCole Nussbaumer Knaflic
Rating: 4 out of 5 stars
4/5
The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power: Barack Obama's Books of 2019
Ebook
The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power: Barack Obama's Books of 2019
byShoshana Zuboff
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 4 out of 5 stars
4/5
Data Analytics for Beginners: Introduction to Data Analytics
Ebook
Data Analytics for Beginners: Introduction to Data Analytics
byAnthony S. Williams
Rating: 4 out of 5 stars
4/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Get Into UX: A foolproof guide to getting your first user experience job
Ebook
Get Into UX: A foolproof guide to getting your first user experience job
byVy Alechnavicius
Rating: 4 out of 5 stars
4/5
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Ebook
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
byMargot Lee Shetterly
Rating: 4 out of 5 stars
4/5
Beginner's Guide to the Obsidian Note Taking App and Second Brain: Everything you Need to Know About the Obsidian Software with 70+ Screenshots to Guide you
Ebook
Beginner's Guide to the Obsidian Note Taking App and Second Brain: Everything you Need to Know About the Obsidian Software with 70+ Screenshots to Guide you
byMarc A. Palmer
Rating: 5 out of 5 stars
5/5
Learn Power BI: A beginner's guide to developing interactive business intelligence solutions using Microsoft Power BI
Ebook
Learn Power BI: A beginner's guide to developing interactive business intelligence solutions using Microsoft Power BI
byGreg Deckler
Rating: 5 out of 5 stars
5/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I.
Ebook
Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I.
byJohn Adamssen
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 4 out of 5 stars
4/5
Learn Algorithmic Trading: Build and deploy algorithmic trading systems and strategies using Python and advanced data analysis
Ebook
Learn Algorithmic Trading: Build and deploy algorithmic trading systems and strategies using Python and advanced data analysis
bySebastien Donadio
Rating: 0 out of 5 stars
0 ratings
Good Code, Bad Code: Think like a software engineer
Ebook
Good Code, Bad Code: Think like a software engineer
byTom Long
Rating: 5 out of 5 stars
5/5
The Alignment Problem: How Can Machines Learn Human Values?
Ebook
The Alignment Problem: How Can Machines Learn Human Values?
byBrian Christian
Rating: 4 out of 5 stars
4/5
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Ebook
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
bySteven Cooper
Rating: 4 out of 5 stars
4/5
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
Ebook
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Learning the Chess Openings
Ebook
Learning the Chess Openings
byJef Kaan
Rating: 5 out of 5 stars
5/5
UX/UI Design Playbook
Ebook
UX/UI Design Playbook
byOlha Bahaieva
Rating: 4 out of 5 stars
4/5
Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning
Ebook
Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning
byAlex J. Gutman
Rating: 5 out of 5 stars
5/5
Practical Data Analysis
Ebook
Practical Data Analysis
byHector Cuesta
Rating: 4 out of 5 stars
4/5
Blender 3D Basics Beginner's Guide Second Edition
Ebook
Blender 3D Basics Beginner's Guide Second Edition
byGordon Fisher
Rating: 5 out of 5 stars
5/5
ChatGPT
Ebook
ChatGPT
byGary Stevens
Rating: 3 out of 5 stars
3/5
Black Holes: The Key to Understanding the Universe
Ebook
Black Holes: The Key to Understanding the Universe
byBrian Cox
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

#608: Generative AI Roundup - August 2023: Simon takes you on a tour of your GenAI options. From software development, to AI policy, to trialli
Podcast episode
#608: Generative AI Roundup - August 2023: Simon takes you on a tour of your GenAI options. From software development, to AI policy, to trialli
byAWS Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Why Enterprise Licensing Changed the Game for Beyond Typicals: In this podcast episode, Sam discusses the development and refinement of our enterprise licensing technology for our software, Beyond Typicals. We outline how this model allows more companies to utilize our product and how it contributes to...
Podcast episode
Why Enterprise Licensing Changed the Game for Beyond Typicals: In this podcast episode, Sam discusses the development and refinement of our enterprise licensing technology for our software, Beyond Typicals. We outline how this model allows more companies to utilize our product and how it contributes to...
byWe Make Civil Engineering Look Good | Working to Make Transportation and other Civil Engineer Projects Better through Outreach, 3D Visualization and More!
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
#536: [INTRODUCING] Amazon Redshift Serverless: With Amazon Redshift Serverless, all users—including data analysts, developers, and data scientists—
Podcast episode
#536: [INTRODUCING] Amazon Redshift Serverless: With Amazon Redshift Serverless, all users—including data analysts, developers, and data scientists—
byAWS Podcast
0 ratings
0% found this document useful
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
Production data labeling workflows: with Mark Christensen, CEO of Xelex.ai
Podcast episode
Production data labeling workflows: with Mark Christensen, CEO of Xelex.ai
byPractical AI: Machine Learning, Data Science, LLM
0 ratings
0% found this document useful
79: BI Brainz Masterclass with Raquel Seville and Anna Ria: Welcome to another exciting masterclass, taking you behind the scenes of how two female leaders on our BI Brainz team created not one, but two of our most downloaded database templates ever! Sharing these juicy details, will be Raquel Seville, CEO of...
Podcast episode
79: BI Brainz Masterclass with Raquel Seville and Anna Ria: Welcome to another exciting masterclass, taking you behind the scenes of how two female leaders on our BI Brainz team created not one, but two of our most downloaded database templates ever! Sharing these juicy details, will be Raquel Seville, CEO of...
byAnalytics on Fire
0 ratings
0% found this document useful
#444: [INTRODUCING] Amazon DevOps Guru: Amazon DevOps Guru is a machine learning powered service that makes it easy to improve an applicatio
Podcast episode
#444: [INTRODUCING] Amazon DevOps Guru: Amazon DevOps Guru is a machine learning powered service that makes it easy to improve an applicatio
byAWS Podcast
0 ratings
0% found this document useful
Conversation AI with Priyanka Vergadia: The podcast today is all about conversational AI and Dialogflow with our Google guest, Priyanka Vergadia.
Podcast episode
Conversation AI with Priyanka Vergadia: The podcast today is all about conversational AI and Dialogflow with our Google guest, Priyanka Vergadia.
byGoogle Cloud Platform Podcast
100%
100% found this document useful
#143 - How to Think Like a Software Engineering Manager - Akanksha Gupta
Podcast episode
#143 - How to Think Like a Software Engineering Manager - Akanksha Gupta
byTech Lead Journal
100%
100% found this document useful
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
Podcast episode
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
byData Engineering Podcast
0 ratings
0% found this document useful
CockroachDB In Depth with Peter Mattis - Episode 35
Podcast episode
CockroachDB In Depth with Peter Mattis - Episode 35
byData Engineering Podcast
0 ratings
0% found this document useful
Engineering interview tips & tricks: with Emma Draper & Jonas
Podcast episode
Engineering interview tips & tricks: with Emma Draper & Jonas
byGo Time: Golang, Software Engineering
0 ratings
0% found this document useful
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Podcast episode
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
byData Engineering Podcast
0 ratings
0% found this document useful
Build Your Own Data Pipeline - Andreas Kretz
Podcast episode
Build Your Own Data Pipeline - Andreas Kretz
byDataTalks.Club
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
Podcast episode
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
byData Engineering Podcast
0 ratings
0% found this document useful
#08 - Tech stack: Metabase, Superset, Redash, Grafana
Podcast episode
#08 - Tech stack: Metabase, Superset, Redash, Grafana
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
Podcast episode
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
byData Engineering Podcast
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
Podcast episode
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Understanding Time-Series Database Patterns
Podcast episode
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
Podcast episode
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
byData Engineering Podcast
0 ratings
0% found this document useful
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Podcast episode
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
byData Engineering Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
Podcast episode
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Powering Vector Search With Real Time And Incremental Vector Indexes: The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.
Podcast episode
Powering Vector Search With Real Time And Incremental Vector Indexes: The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.
byData Engineering Podcast
0 ratings
0% found this document useful
New Trends in Serverless
Podcast episode
New Trends in Serverless
byThe Cloudcast
0 ratings
0% found this document useful
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
Podcast episode
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
byData Engineering Podcast
0 ratings
0% found this document useful
Episode 50: If You Lose Data, Your Company is Having a Very Bad Day: If you use MongoDB, then you may be feeling ecstatic right now. Why? Amazon Web Services (AWS) just released DocumentDB with MongoDB compatibility. Users who switch from MongoDB to DocumentDB can expect improved speed, scalability, and availability. Toda
Podcast episode
Episode 50: If You Lose Data, Your Company is Having a Very Bad Day: If you use MongoDB, then you may be feeling ecstatic right now. Why? Amazon Web Services (AWS) just released DocumentDB with MongoDB compatibility. Users who switch from MongoDB to DocumentDB can expect improved speed, scalability, and availability. Toda
byScreaming in the Cloud
0 ratings
0% found this document useful

Skip carousel

Grafana Terminology
Linux Format
Article
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
Types Of Databases
Linux Format
Article
Types Of Databases
Aug 27, 2019
NoSQL databases provide the performance, scalability and stability that’s required by the modern data-driven apps we interact with these days. But that is where the similarity between NoSQL systems end. In fact, it wouldn’t be wrong to say that the o
1 min read
Build Your First Reverse Proxy
Maximum PC
Article
Build Your First Reverse Proxy
Jan 7, 2020
7 min read
Monitor Systems And Docker Deployments
Linux Format
Article
Monitor Systems And Docker Deployments
Jun 30, 2020
Welcome to Netdata, software for distributed real-time performance and health monitoring of UNIX machines. Don’t you dare turn that page! A key advantage of Netdata is that it collects all of its metrics without introducing too much load on to the Li
8 min read
Basic Concepts
Linux Format
Article
Basic Concepts
Jul 2, 2019
A messaging system such as Kafka enables you to send messages between processes, applications and servers. Applications connect to Kafka to send or get data. Strictly speaking, a Kafka ‘topic’ is a unit of storage in Kafka: data in Kafka is stored in
1 min read
Understanding ELT & ETL
Techfastly
Article
Understanding ELT & ETL
Apr 1, 2021
8 min read
All Your Database Are Belong To Us
Linux Format
Article
All Your Database Are Belong To Us
Apr 6, 2021
7 min read
Grafana, Telegraf And Influxdb
Linux Format
Article
Grafana, Telegraf And Influxdb
Jun 30, 2020
If you don’t like Netdata or if you want to try something else, you can give Grafana (https://grafana.com), Telegraf (www.influxdata.com/time-series-platform/telegraf) and InfluxDB (www.influxdata.com/products/influxdb-overview) a try. Grafana can’t
1 min read
KAFKA Build Utilities With The Kafka Server
Linux Format
Article
KAFKA Build Utilities With The Kafka Server
Jul 2, 2019
Nowadays, quite a few data architectures involve both a database and Apache Kafka, which is a distributed streaming platform and the subject of this tutorial. You can also find Kafka described as a publish-subscribe message system, which is a fancy w
7 min read
Rediscover Speed With The Redis Revolution
Linux Format
Article
Rediscover Speed With The Redis Revolution
Jul 25, 2023
Credit: https://redis.io Redis is an open-source, in-memory data structure store that has gained popularity R as a highly efficient caching and messaging system. It prioritises speed, efficiency and versatility, making it a top choice for various ap
8 min read
Supercomputer On A Platter
Business Today
Article
Supercomputer On A Platter
Apr 1, 2022
CHENNAI-HEADQUARTERED automobile major TVS Motor Company uses high-performance computing (HPC) for running R&D simulations and testing the aero-dynamics of two-wheelers, which allows it to make the vehicles stable at speed and more efficient, cool en
7 min read
Enterprise Soaring Success
Linux Format
Article
Enterprise Soaring Success
Aug 27, 2019
7 min read
Mining Actionable Information with Smart Capture
The European Business Review
Article
Mining Actionable Information with Smart Capture
May 22, 2018
4 min read
Set Up A Production- Ready Web Server
APC
Article
Set Up A Production- Ready Web Server
Nov 4, 2019
8 min read
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
Techfastly
Article
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
May 1, 2022
3 min read
Set Up A Production-ready Web Server
Linux Format
Article
Set Up A Production-ready Web Server
Sep 24, 2019
8 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
Monitor And Graph Your System Metrics
Linux Format
Article
Monitor And Graph Your System Metrics
Dec 13, 2022
Credit: https://oss.oetiker.ch/rrdtool Matt Holder has worked in IT support for over a decade, and always tries to use Linux alongside other installed systems. The code used in this article can be downloaded from https:// github.com/ mattmole/ LXF297
8 min read
Building A Better File Server With The Pi
APC
Article
Building A Better File Server With The Pi
Dec 27, 2021
4 min read
AI As A Service
PC Pro Magazine
Article
AI As A Service
Jul 9, 2020
2 min read
Accurate, Open Source IP-based Localisation
Linux Format
Article
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read
Darq
PC Pro Magazine
Article
Darq
Jul 9, 2022
3 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
Salesforce Adding Einstein Analytics Al To Tableau Platform
Techfastly
Article
Salesforce Adding Einstein Analytics Al To Tableau Platform
Feb 4, 2021
3 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
Open Success
Linux Format
Article
Open Success
Nov 17, 2020
“ClickHouse was developed for Yandex Metrics (the Russian equivalent of Google Analytics) as a data store and was Apache 2 licenced in 2016. In 2020. Altinity picked up $4m in funding to help it finish off a ClickHouse cloud service that’s in private
1 min read
Building A Better File Server With The Pi
Linux Format
Article
Building A Better File Server With The Pi
Sep 21, 2021
Running your own cloud storage server saves money, allows you to expand storage as necessary, and can be done with a device as small as a Raspberry Pi. Our previous guide to setting up a Nextcloud server on the Raspberry Pi (LXF280) covered everythin
4 min read
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
PC Pro Magazine
Article
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
Jul 8, 2021
6 min read
Data-driven Decision Making That Uses Data, Mind And Heart
The European Business Review
Article
Data-driven Decision Making That Uses Data, Mind And Heart
Jan 31, 2020
14 min read
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Business Today
Article
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Jan 20, 2023
2 min read

Related categories

Skip carousel

Reviews for Hadoop Blueprints

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Hadoop Blueprints - Anurag Shrivastava

Hadoop Blueprints

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Hadoop and Big Data

The beginning of the big data problem

Limitations of RDBMS systems

Scaling out a database on Google

Parallel processing of large datasets

Building open source Hadoop

Enterprise Hadoop

Social media and mobile channels

Data storage cost reduction

Enterprise software vendors

Pure Play Hadoop vendors

Cloud Hadoop vendors

The design of the Hadoop system

The Hadoop Distributed File System (HDFS)

Data organization in HDFS

HDFS file management commands

NameNode and DataNodes

Metadata store in NameNode

Preventing a single point of failure with Hadoop HA

Checkpointing process

Data Store on a DataNode

Handshakes and heartbeats

MapReduce

The execution model of MapReduce Version 1

Apache YARN

Building a MapReduce Version 2 program

Problem statement

Solution workflow

Getting the dataset

Studying the dataset

Cleaning the dataset

Loading the dataset on the HDFS

Starting with a MapReduce program

Installing Eclipse

Creating a project in Eclipse

Coding and building a MapReduce program

Run the MapReduce program locally

Examine the result

Run the MapReduce program on Hadoop

Further processing of results

Hadoop platform tools

Data ingestion tools

Data access tools

Monitoring tools

Data governance tools

Big data use cases

Creating a 360 degree view of a customer

Fraud detection systems for banks

Marketing campaign planning

Churn detection in telecom

Analyzing sensor data

Building a data lake

The architecture of Hadoop-based systems

Lambda architecture

Summary

2. A 360-Degree View of the Customer

Capturing business information

Collecting data from data sources

Creating a data processing approach

Presenting the results

Setting up the technology stack

Tools used

Installing Hortonworks Sandbox

Creating user accounts

Exploring HUE

Exploring MYSQL and the HIVE command line

Exploring Sqoop at the command line

Test driving Hive and Sqoop

Querying data using Hive

Importing data in Hive using Sqoop

Engineering the solution

Datasets

Loading customer master data into Hadoop

Loading web logs into Hadoop

Loading tweets into Hadoop

Creating the 360-degree view

Exporting data from Hadoop

Presenting the view

Building a web application

Installing Node.js

Coding the web application in Node.js

Summary

3. Building a Fraud Detection System

Understanding the business problem

Selecting and cleansing the dataset

Finding relevant fields

Machine learning for fraud detection

Clustering as an unsupervised machine learning method

Designing the high-level architecture

Introducing Apache Spark

Apache Spark architecture

Resilient Distributed Datasets

Transformation functions

Actions

Test driving Apache Spark

Calculating the yearly average stock prices using Spark

Apache Spark 2.X

Understanding MLib

Test driving K-means using MLib

Creating our fraud detection model

Building our K-means clustering model

Processing the data

Putting the fraud detection model to use

Generating a data stream

Processing the data stream using Spark streaming

Putting the model to use

Scaling the solution

Summary

4. Marketing Campaign Planning

Creating the solution outline

Supervised learning

Tree-structure models for classification

Finding the right dataset

Setting the up the solution architecture

Coupon scan at POS

Join and transform

Train the classification model

Scoring

Mail merge

Building the machine learning model

Introducing BigML

Model building steps

Sign up as a user on BigML site

Upload the data file

Creating the dataset

Building the classification model

Downloading the classification model

Running the Model on Hadoop

Creating the target List

Post campaign activities

Summary

5. Churn Detection

A business case for churn detection

Creating the solution outline

Building a predictive model using Hadoop

Bayes' Theorem

Playing with the Bayesian predictor

Running a Node.js-based Bayesian predictor

Understanding the predictor code

Limitations of our solution

Building a churn predictor using Hadoop

Synthetic data generation tools

Preparing a synthetic historical churn dataset

The processing approach

Running the MapReduce program

Understanding the frequency counter code

Putting the model to use

Integrating the churn predictor

Summary

6. Analyze Sensor Data Using Hadoop

A business case for sensor data analytics

Creating the solution outline

Technology stack

Kafka

Flume

HDFS

Hive

Open TSDB

HBase

Grafana

Batch data analytics

Loading streams of sensor data from Kafka topics to HDFS

Using Hive to perform analytics on inserted data

Data visualization in MS Excel

Stream data analytics

Loading streams of sensor data

Data visualization using Grafana

Summary

7. Building a Data Lake

Data lake building blocks

Ingestion tier

Storage tier

Insights tier

Ops facilities

Limitation of open source Hadoop ecosystem tools

Hadoop security

HDFS permissions model

Fine-grained permissions with HDFS ACLs

Apache Ranger

Installing Apache Ranger

Test driving Apache Ranger

Define services and access policies

Examine the audit logs

Viewing users and groups in Ranger

Data Lake security with Apache Ranger

Apache Flume

Understanding the Design of Flume

Installing Apache Flume

Running Apache Flume

Apache Zeppelin

Installation of Apache Zeppelin

Test driving Zeppelin

Exploring data visualization features of Zeppelin

Define the gold price movement table in Hive

Load gold price history in the Table

Run a select query

Plot price change per month

Running the paragraph

Zeppelin in Data Lake

Technology stack for Data Lake

Data Lake business requirements

Understanding the business requirements

Understanding the IT systems and security

Designing the data pipeline

Building the data pipeline

Setting up the access control

Synchronizing the users and groups in Ranger

Setting up data access policies in Ranger

Restricting the access in Zeppelin

Testing our data pipeline

Scheduling the data loading

Refining the business requirements

Implementing the new requirements

Loading the stock holding data in Data Lake

Restricting the access to stock holding data in Data Lake

Testing the Loaded Data with Zeppelin

Adding stock feed in the Data Lake

Fetching data from Yahoo Service

Configuring Flume

Running Flume as Stock Feeder to Data Lake

Transforming the data in Data Lake

Growing Data Lake

Summary

8. Future Directions

Hadoop solutions team

The role of the data engineer

Data science for non-experts

From the data science model to business value

Hadoop on Cloud

Deploying Hadoop on cloud servers

Using Hadoop as a service

NoSQL databases

Types of NoSQL databases

Common observations about NoSQL databases

In-memory databases

Apache Ignite as an in-memory database

Apache Ignite as a Hadoop accelerator

Apache Spark versus Apache Ignite

Summary

Hadoop Blueprints

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: September 2016

Production reference: 1270916

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78398-030-7

www.packtpub.com

Credits

About the Authors

Anurag Shrivastava is an entrepreneur, blogger, and manager living in Almere near Amsterdam in the Netherlands. He started his IT journey by writing a small poker program on a mainframe computer 30 years back, and he fell in love with software technology. In his 24-year career in IT, he has worked for companies of various sizes, ranging from Internet start-ups to large system integrators in Europe.

Anurag kick-started the Agile software movement in North India when he set up the Indian business unit for the Dutch software consulting company Xebia. He led the growth of Xebia India as the managing director of the company for over 6 years and made the company a well-known name in the Agile consulting space in India. He also started the Agile NCR Conference, which has become a heavily visited annual event on Agile best practices, in the New Delhi Capital Region.

Anurag became active in the big data space when he joined ING Bank in Amsterdam as the manager of the customer intelligence department, where he set up their first Hadoop cluster and implemented several transformative technologies, such as Netezza and R, in his department. He is now active in the payment technology and APIs, using technologies such as Node.js and MongoDB.

Anurag loves to cycle on the reclaimed island of Flevoland in the Netherlands. He also likes listening to Hindi film music.

I would like to thank my wife, Anjana, and daughter, Anika, for putting up with my late-night writing sessions and skipping of weekend breaks. I also would like to thank my parents and teachers for their guidance in life.

I would like to express my gratitude to colleagues at Xebia and Daan Teunissen, where I learned about the value of technical writing from colleagues, who inspired me to work on this book project. I would like to thank all the mentors that I’ve had over the years. I would like to express thanks and gratitude to Amir Arooni, my boss at ING Bank, who provided me time and opportunity to work on big data and, later on, this book. I also give thanks to the Packt team and the coauthor, Tanmay, who provided help and guidance in the whole process.

Tanmay Deshpande is a Hadoop and big data evangelist. He's interested in a wide range of technologies, such as Apache Spark, Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, and cloud computing. He has vast experience in application development in various domains, such as finance, telecoms, manufacturing, security, and retail. He enjoys solving machine learning problems and spends his time reading anything he can get his hands on. He has a great interest in open source technologies and promotes them through his lectures. He has been invited to various computer science colleges to conduct brainstorming sessions with students on the latest technologies. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. Tanmay is currently working with Schlumberger as the lead big data developer. Before Schlumberger, Tanmay worked with Lumiata, Symantec, and Infosys.

Tanmay is the author of books such as Hadoop Real World Solutions Cookbook-Second Edition, DynamoDB Cookbook, and Mastering DynamoDB, all by Packt Publishing.

I would like to thank my family and the Almighty for supporting me throughout my all adventures.

About the Reviewers

Dedunu Dhananjaya is a senior software engineer in personalized learning and analytics at Pearson. He is interested in data science and analytics. Prior to Pearson, Dedunu worked at Zaizi, LIRNEasia, and WSO2. Currently, he is reading his masters in applied statistics at the University of Colombo.

Wissem El Khlifi is the first Oracle ACE from Spain and an Oracle Certified Professional DBA with over 12 years of IT experience.

He earned his computer science engineering degree from FST Tunisia and master's in computer science as well as in big data science analytics and management from UPC Barcelona. His areas of interest are Linux system administration, high availability Oracle databases, big data NOSQL database management, and big data analysis.

His career has included the following roles: Oracle and Java analyst/programmer, Oracle DBA, architect, team leader, and big data scientist. He currently works as a senior database and applications engineer for Schneider Electric/APC. He writes numerous articles on his website, http://www.oracle-class.com, and his Twitter handle is @orawiss.

Randal Scott King is the managing partner of Brilliant Data, a consulting firm specializing in data analytics. In his years of consulting, Scott has amassed an impressive list of clientele, from mid-market leaders to Fortune 500 household names. In addition to Hadoop Blueprints, he has also served as technical reviewer for other Packt Publishing books on big data and has authored the instructional videos Learning Hadoop 2 and Mastering Hadoop. Scott lives just outside Atlanta, GA, with his children. You can visit his blog at http://www.randalscottking.com.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Preface

This book covers the application of Hadoop and its ecosystem of tools to solve business problems. Hadoop has fast emerged as the leading big data platform and finds applications in many industries where massive datasets or big data has to be stored and analyzed. Hadoop lowers the cost of investment in the storage. It supports the generation of new business insights, which was not possible earlier because of the massive volumes and computing capacity required to process such information. This book covers several business cases to build solutions to business problems. Each solution covered in this book has been built using Hadoop and HDFS and the set of tools from the Hadoop ecosystem.

What this book covers

Chapter 1, Hadoop and Big Data, goes over how Hadoop has played a pivotal role in making several Internet businesses successful with big data from its beginnings in the previous decade. This chapter covers a brief history and the story of the evolution of Hadoop. It covers the Hadoop architecture and the MapReduce data processing framework. It introduces basic Hadoop programming in Java and provides a detailed overview of the business cases covered in the following chapters of this book. This chapter builds the foundation for understanding the rest of the book.

Chapter 2, A 360-Degree View of the Customer, covers building a 360-degree view of the customer. A good 360-degree view requires the integration of data from various sources. The data sources are database management systems storing master data and transactional data. Other data sources might include data captured from social media feeds. In this chapter, we will be integrating data from CRM systems, web logs, and Twitter feeds to build the 360-degree view and present it using a simple web interface. We will learn about Apache Sqoop and Apache Hive in the process of building our solution.

Chapter 3, Building a Fraud Detection System, covers the building of a real-time fraud detection system. This system predicts whether a financial transaction could be fraudulent by applying a clustering algorithm on a stream of transactions. We will learn about the architecture of the system and the coding steps involved in building the system. We will learn about Apache Spark in the process of building our solution.

Chapter 4, Marketing Campaign Planning, shows how to build a system that can improve the effectiveness of marketing campaigns. This system is a batch analytics system that uses historical campaign-response data to predict who is going to respond to a marketing folder. We will see how we can build a predictive model and use it to predict who is going to respond to which folder in our marketing campaign. We will learn about BigML in the process of building our solution.

Chapter 5, Churn Detection, explains how to use Hadoop to predict which customers are likely to move over to another company. We will cover the business case of a mobile telecom provider who would like to detect the customers who are likely to churn. These customers are given special incentives so that they can stay with the same provider. We will apply Bayes' Theorem to calculate the likelihood of churn. The model for churn detection will be built using Hadoop. We will learn about writing MapReduce programs in Java in the process of building our solution.

Chapter 6, Analyze Sensor Data Using Hadoop, is about how to build a system to analyze sensor data. Nowadays, sensors are considered an important source of big data. We will learn how Hadoop and big-data technologies can be helpful in the Internet of Things (IoT) domain. IoT is a network of connected devices that generate data through sensors. We will build a system to monitor the quality of the environment, such as humidity and temperature, in a factory. We will introduce Apache Kafka, Grafana, and OpenTSDB tools in the process of building the solution.

Chapter 7, Building a Data Lake, takes you through building a data lake using Hadoop and several other tools to import data in a data lake and provide secure access to the data. Data lakes are a popular business case for Hadoop. In a data lake, we store data from multiple sources to build a single source of data for the enterprise and build a security layer around it. We will learn about Apache Ranger, Apache Flume, and Apache Zeppelin in the process of building our solution.

Chapter 8, Future Directions, covers four separate topics that are relevant to Hadoop-based projects. These topics are building a Hadoop solutions team, Hadoop on the cloud, NoSQL databases, and in-memory databases. This chapter does not include any coding examples, unlike the other chapters. These fours topics have been covered in the essay form so that you can explore them further.

What you need for this book

Code and data samples have been provided for every chapter. We have used Hadoop version 2.7.x in this book. All the coding samples have been developed and tested on the stock (Apache Software Foundation) version of Hadoop and other tools. You can download these tools from the Apache Software Foundation website. In Chapter 2, A 360-Degree View of the Customer, we have used Hortonworks Data Platform (HDP) 2.3. HDP 2.3 is a bundle of Hadoop and several other tools from the ecosystem in a convenient virtual machine image that can run on VirtualBox or VMWare. You can download this virtual image from the website of Hortonworks at http://hortonworks.com/downloads/#data-platform. Due to the fast-evolving nature of Hadoop and its ecosystem of tools, you might find that newer versions are available than the ones used in this book. The specific versions of the tools needed for the examples have been mentioned in the chapters where they are first introduced.

Who this book is for

This book is intended for software developers, architects, and engineering managers who are evaluating Hadoop as a technology to build business solutions using big data. This book explains how the tools in the Hadoop ecosystem can be combined to create a useful solution, and therefore, it is particularly useful for those who would like to understand how various technologies can be integrated without understanding any particular tool in depth.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: You can also run the transmodel.py program using the Python command-line interpreter pyspark.

A block of code is set as follows:

#!/bin/bash

while [ true ]

echo 1 2 $RANDOM

sleep 1

done

Any command-line input or output is written as follows:

>>> from pyspark.mllib.clustering import KMeans, KMeansModel

>>> from numpy import array

New terms and important words are shown in bold.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have

Enjoying the preview?

Page 1 of 1

Hadoop Blueprints

About this ebook

Anurag Shrivastava

Related authors

Related to Hadoop Blueprints

Related ebooks

Hadoop Beginner's Guide

HDInsight Essentials - Second Edition

Learning HBase

Learn Hadoop in 24 Hours

Data Lake Development with Big Data

Mastering Hadoop

Monitoring Hadoop

Hadoop Essentials

Apache Hive Essentials

Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL

DynamoDB Applied Design Patterns

Exploring Hadoop Ecosystem (Volume 2): Stream Processing

Implementing Cloud Design Patterns for AWS

Cassandra High Availability

Data Engineering on Azure

Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake

Exploring Hadoop Ecosystem (Volume 1): Batch Processing

Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way

Cloudera Administration Handbook

Fast Data Processing with Spark 2 - Third Edition

Hadoop Real-World Solutions Cookbook - Second Edition

Apache Hive Cookbook

Instant Pentaho Data Integration Kitchen

Hadoop in Practice

Snowflake Cookbook: Techniques for building modern cloud data warehousing solutions

Learning Azure DocumentDB

Azure in Action

Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools

Apache Cassandra Essentials

Practical OneOps

Computers For You

Algorithms to Live By: The Computer Science of Human Decisions

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees

Storytelling with Data: Let's Practice!

The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power: Barack Obama's Books of 2019

Grokking Algorithms: An illustrated guide for programmers and other curious people

Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing

Data Analytics for Beginners: Introduction to Data Analytics

Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates

Elon Musk

Get Into UX: A foolproof guide to getting your first user experience job

Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race

Beginner's Guide to the Obsidian Note Taking App and Second Brain: Everything you Need to Know About the Obsidian Software with 70+ Screenshots to Guide you

Learn Power BI: A beginner's guide to developing interactive business intelligence solutions using Microsoft Power BI

Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work

People Skills for Analytical Thinkers

Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I.

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology

Learn Algorithmic Trading: Build and deploy algorithmic trading systems and strategies using Python and advanced data analysis

Good Code, Bad Code: Think like a software engineer

The Alignment Problem: How Can Machines Learn Human Values?

Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention

The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution

Learning the Chess Openings

UX/UI Design Playbook

Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning

Practical Data Analysis

Blender 3D Basics Beginner's Guide Second Edition

ChatGPT

Black Holes: The Key to Understanding the Universe

Related podcast episodes

Related articles

Related categories

Reviews for Hadoop Blueprints

What did you think?

Book preview

Hadoop Blueprints - Anurag Shrivastava

Table of Contents

Hadoop Blueprints

Hadoop Blueprints

About the Authors

About the Reviewers

www.PacktPub.com