HBase in Action
By Amandeep Khurana and Nick Dimiduk
()
About this ebook
HBase in Action has all the knowledge you need to design, build, and run applications using HBase. First, it introduces you to the fundamentals of distributed systems and large scale data handling. Then, you'll explore real-world applications and code samples with just enough theory to understand the practical techniques. You'll see how to build applications with HBase and take advantage of the MapReduce processing framework. And along the way you'll learn patterns and best practices.
About the Technology
HBase is a NoSQL storage system designed for fast, random access to large volumes of data. It runs on commodity hardware and scales smoothly from modest datasets to billions of rows and millions of columns.
About this Book
HBase in Action is an experience-driven guide that shows you how to design, build, and run applications using HBase. First, it introduces you to the fundamentals of handling big data. Then, you'll explore HBase with the help of real applications and code samples and with just enough theory to back up the practical techniques. You'll take advantage of the MapReduce processing framework and benefit from seeing HBase best practices in action.
Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book.
What's Inside
- When and how to use HBase
- Practical examples
- Design patterns for scalable data systems
- Deployment, integration, and design
Written for developers and architects familiar with data storage and processing. No prior knowledge of HBase, Hadoop, or MapReduce is required.
Table of Contents
-
PART 1 HBASE FUNDAMENTALS
- Introducing HBase
- Getting started
- Distributed HBase, HDFS, and MapReduce PART 2 ADVANCED CONCEPTS
- HBase table design
- Extending HBase with coprocessors
- Alternative HBase clients PART 3 EXAMPLE APPLICATIONS
- HBase by example: OpenTSDB
- Scaling GIS on HBase PART 4 OPERATIONALIZING HBASE
- Deploying HBase
- Operations
Amandeep Khurana
Amandeep Khurana is a Solutions Architect at Cloudera where he builds solutions based on the Hadoop ecosystem, and was previously a part of the Amazon Elastic MapReduce team.
Related to HBase in Action
Related ebooks
Isomorphic Web Applications: Universal Development with React Rating: 0 out of 5 stars0 ratingsSpark in Action Rating: 0 out of 5 stars0 ratingsMastering Large Datasets with Python: Parallelize and Distribute Your Python Code Rating: 0 out of 5 stars0 ratingsJess in Action: Rule-Based Systems in Java Rating: 0 out of 5 stars0 ratingsNode.js in Practice Rating: 0 out of 5 stars0 ratingsGetting MEAN with Mongo, Express, Angular, and Node Rating: 5 out of 5 stars5/5Hadoop in Practice Rating: 0 out of 5 stars0 ratingsHadoop in Action Rating: 0 out of 5 stars0 ratingsElectron in Action Rating: 0 out of 5 stars0 ratingsFunctional Programming in Scala Rating: 4 out of 5 stars4/5Get Programming with JavaScript Next: New features of ECMAScript 2015, 2016, and beyond Rating: 0 out of 5 stars0 ratingsMongoDB in Action: Covers MongoDB version 3.0 Rating: 0 out of 5 stars0 ratingsRelevant Search: With applications for Solr and Elasticsearch Rating: 5 out of 5 stars5/5Rx.NET in Action Rating: 0 out of 5 stars0 ratingsNeo4j in Action Rating: 0 out of 5 stars0 ratingsAkka in Action Rating: 0 out of 5 stars0 ratingsServerless Architectures on AWS: With examples using AWS Lambda Rating: 0 out of 5 stars0 ratingsAlgorithms of the Intelligent Web Rating: 0 out of 5 stars0 ratingsMaking Sense of NoSQL: A guide for managers and the rest of us Rating: 0 out of 5 stars0 ratingsPlay for Java Rating: 0 out of 5 stars0 ratingsThe Mikado Method Rating: 0 out of 5 stars0 ratingsJavaScript Application Design: A Build First Approach Rating: 0 out of 5 stars0 ratingsWeb Performance in Action: Building Fast Web Pages Rating: 0 out of 5 stars0 ratingsRedis in Action Rating: 0 out of 5 stars0 ratingsParallel and High Performance Computing Rating: 0 out of 5 stars0 ratingsD3.js in Action: Data visualization with JavaScript Rating: 0 out of 5 stars0 ratingsGo in Practice Rating: 5 out of 5 stars5/5Elasticsearch in Action Rating: 0 out of 5 stars0 ratingsDocker in Action, Second Edition Rating: 3 out of 5 stars3/5Scala in Action Rating: 0 out of 5 stars0 ratings
Computers For You
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Algorithms to Live By: The Computer Science of Human Decisions Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Storytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/5Data Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I. Rating: 4 out of 5 stars4/5Master Obsidian Quickly: Boost Your Learning & Productivity with a Free, Modern, Powerful Knowledge Toolkit Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Practical Data Analysis Rating: 4 out of 5 stars4/5Learning the Chess Openings Rating: 5 out of 5 stars5/5The Alignment Problem: How Can Machines Learn Human Values? Rating: 4 out of 5 stars4/5Get Into UX: A foolproof guide to getting your first user experience job Rating: 4 out of 5 stars4/5Prompt Engineering ; The Future Of Language Generation Rating: 3 out of 5 stars3/5Black Holes: The Key to Understanding the Universe Rating: 5 out of 5 stars5/5Computer Science I Essentials Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5Learn Algorithmic Trading: Build and deploy algorithmic trading systems and strategies using Python and advanced data analysis Rating: 0 out of 5 stars0 ratingsThe ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Data Visualization with Excel Dashboards and Reports Rating: 4 out of 5 stars4/5
Reviews for HBase in Action
0 ratings0 reviews
Book preview
HBase in Action - Amandeep Khurana
Copyright
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 261
Shelter Island, NY 11964
Email:
©2013 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
ISBN 9781617290527
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – MAL – 17 16 15 14 13 12
Brief Table of Contents
Copyright
Brief Table of Contents
Table of Contents
Foreword
Letter to the HBase Community
Preface
Acknowledgments
About this Book
About the Authors
About the Cover Illustration
1. HBase fundamentals
Chapter 1. Introducing HBase
Chapter 2. Getting started
Chapter 3. Distributed HBase, HDFS, and MapReduce
2. Advanced concepts
Chapter 4. HBase table design
Chapter 5. Extending HBase with coprocessors
Chapter 6. Alternative HBase clients
3. Example applications
Chapter 7. HBase by example: OpenTSDB
Chapter 8. Scaling GIS on HBase
4. Operationalizing HBase
Chapter 9. Deploying HBase
Chapter 10. Operations
Appendix A. Exploring the HBase system
Appendix B. More about the workings of HDFS
Index
List of Figures
List of Tables
List of Listings
Table of Contents
Copyright
Brief Table of Contents
Table of Contents
Foreword
Letter to the HBase Community
Preface
Acknowledgments
About this Book
About the Authors
About the Cover Illustration
1. HBase fundamentals
Chapter 1. Introducing HBase
1.1. Data-management systems: a crash course
1.1.1. Hello, Big Data
1.1.2. Data innovation
1.1.3. The rise of HBase
1.2. HBase use cases and success stories
1.2.1. The canonical web-search problem: the reason for Bigtable’s invention
1.2.2. Capturing incremental data
1.2.3. Content serving
1.2.4. Information exchange
1.3. Hello HBase
1.3.1. Quick install
1.3.2. Interacting with the HBase shell
1.3.3. Storing data
1.4. Summary
Chapter 2. Getting started
2.1. Starting from scratch
2.1.1. Create a table
2.1.2. Examine table schema
2.1.3. Establish a connection
2.1.4. Connection management
2.2. Data manipulation
2.2.1. Storing data
2.2.2. Modifying data
2.2.3. Under the hood: the HBase write path
2.2.4. Reading data
2.2.5. Under the hood: the HBase read path
2.2.6. Deleting data
2.2.7. Compactions: HBase housekeeping
2.2.8. Versioned data
2.2.9. Data model recap
2.3. Data coordinates
2.4. Putting it all together
2.5. Data models
2.5.1. Logical model: sorted map of maps
2.5.2. Physical model: column family oriented
2.6. Table scans
2.6.1. Designing tables for scans
2.6.2. Executing a scan
2.6.3. Scanner caching
2.6.4. Applying filters
2.7. Atomic operations
2.8. ACID semantics
2.9. Summary
Chapter 3. Distributed HBase, HDFS, and MapReduce
3.1. A case for MapReduce
3.1.1. Latency vs. throughput
3.1.2. Serial execution has limited throughput
3.1.3. Improved throughput with parallel execution
3.1.4. MapReduce: maximum throughput with distributed parallelism
3.2. An overview of Hadoop MapReduce
3.2.1. MapReduce data flow explained
3.2.2. MapReduce under the hood
3.3. HBase in distributed mode
3.3.1. Splitting and distributing big tables
3.3.2. How do I find my region?
3.3.3. How do I find the -ROOT- table?
3.4. HBase and MapReduce
3.4.1. HBase as a source
3.4.2. HBase as a sink
3.4.3. HBase as a shared resource
3.5. Putting it all together
3.5.1. Writing a MapReduce application
3.5.2. Running a MapReduce application
3.6. Availability and reliability at scale
Availability
Reliability and Durability
3.6.1. HDFS as the underlying storage
3.7. Summary
2. Advanced concepts
Chapter 4. HBase table design
4.1. How to approach schema design
4.1.1. Modeling for the questions
4.1.2. Defining requirements: more work up front always pays
4.1.3. Modeling for even distribution of data and load
4.1.4. Targeted data access
4.2. De-normalization is the word in HBase land
4.3. Heterogeneous data in the same table
4.4. Rowkey design strategies
4.5. I/O considerations
4.5.1. Optimized for writes
4.5.2. Optimized for reads
4.5.3. Cardinality and rowkey structure
4.6. From relational to non-relational
4.6.1. Some basic concepts
4.6.2. Nested entities
4.6.3. Some things don’t map
4.7. Advanced column family configurations
4.7.1. Configurable block size
4.7.2. Block cache
4.7.3. Aggressive caching
4.7.4. Bloom filters
4.7.5. TTL
4.7.6. Compression
4.7.7. Cell versioning
4.8. Filtering data
4.8.1. Implementing a filter
4.8.2. Prebundled filters
4.9. Summary
Chapter 5. Extending HBase with coprocessors
5.1. The two kinds of coprocessors
5.1.1. Observer coprocessors
5.1.2. Endpoint Coprocessors
5.2. Implementing an observer
5.2.1. Modifying the schema
5.2.2. Starting with the Base
5.2.3. Installing your observer
5.2.4. Other installation options
5.3. Implementing an endpoint
5.3.1. Defining an interface for the endpoint
5.3.2. Implementing the endpoint server
5.3.3. Implement the endpoint client
5.3.4. Deploying the endpoint server
5.3.5. Try it!
5.4. Summary
Chapter 6. Alternative HBase clients
6.1. Scripting the HBase shell from UNIX
6.1.1. Preparing the HBase shell
6.1.2. Script table schema from the UNIX shell
6.2. Programming the HBase shell using JRuby
6.2.1. Preparing the HBase shell
6.2.2. Interacting with the TwitBase users table
6.3. HBase over REST
6.3.1. Launching the HBase REST service
6.3.2. Interacting with the TwitBase users table
6.4. Using the HBase Thrift gateway from Python
6.4.1. Generating the HBase Thrift client library for Python
6.4.2. Launching the HBase Thrift service
6.4.3. Scanning the TwitBase users table
6.5. Asynchbase: an alternative Java HBase client
6.5.1. Creating an asynchbase project
6.5.2. Changing TwitBase passwords
6.5.3. Try it out
6.6. Summary
3. Example applications
Chapter 7. HBase by example: OpenTSDB
7.1. An overview of OpenTSDB
7.1.1. Challenge: infrastructure monitoring
7.1.2. Data: time series
7.1.3. Storage: HBase
7.2. Designing an HBase application
7.2.1. Schema design
7.2.2. Application architecture
7.3. Implementing an HBase application
7.3.1. Storing data
7.3.2. Querying data
7.4. Summary
Chapter 8. Scaling GIS on HBase
8.1. Working with geographic data
8.2. Designing a spatial index
8.2.1. Starting with a compound rowkey
8.2.2. Introducing the geohash
8.2.3. Understand the geohash
8.2.4. Using the geohash as a spatially aware rowkey
8.3. Implementing the nearest-neighbors query
8.4. Pushing work server-side
8.4.1. Creating a geohash scan from a query polygon
8.4.2. Within query take 1: client side
8.4.3. Within query take 2: WithinFilter
8.5. Summary
4. Operationalizing HBase
Chapter 9. Deploying HBase
9.1. Planning your cluster
9.1.1. Prototype cluster
9.1.2. Small production cluster (10–20 servers)
9.1.3. Medium production cluster (up to ~50 servers)
9.1.4. Large production cluster (>~50 servers)
9.1.5. Hadoop Master nodes
9.1.6. HBase Master
9.1.7. Hadoop DataNodes and HBase RegionServers
9.1.8. ZooKeeper(s)
9.1.9. What about the cloud?
9.2. Deploying software
9.2.1. Whirr: deploying in the cloud
9.3. Distributions
9.3.1. Using the stock Apache distribution
9.3.2. Using Cloudera’s CDH distribution
9.4. Configuration
9.4.1. HBase configurations
9.4.2. Hadoop configuration parameters relevant to HBase
9.4.3. Operating system configurations
9.5. Managing the daemons
9.6. Summary
Chapter 10. Operations
10.1. Monitoring your cluster
10.1.1. How HBase exposes metrics
10.1.2. Collecting and graphing the metrics
10.1.3. The metrics HBase exposes
10.1.4. Application-side monitoring
10.2. Performance of your HBase cluster
10.2.1. Performance testing
10.2.2. What impacts HBase’s performance?
10.2.3. Tuning dependency systems
10.2.4. Tuning HBase
10.3. Cluster management
10.3.1. Starting and stopping HBase
10.3.2. Graceful stop and decommissioning nodes
10.3.3. Adding nodes
10.3.4. Rolling restarts and upgrading
10.3.5. bin/hbase and the HBase shell
10.3.6. Maintaining consistency—hbck
10.3.7. Viewing HFiles and HLogs
10.3.8. Presplitting tables
10.4. Backup and replication
10.4.1. Inter-cluster replication
10.4.2. Backup using MapReduce jobs
10.4.3. Backing up the root directory
10.5. Summary
Appendix A. Exploring the HBase system
A.1. Exploring ZooKeeper
A.2. Exploring -ROOT-
A.3. Exploring .META.
Appendix B. More about the workings of HDFS
B.1. Distributed file systems
B.2. Separating metadata and data: NameNode and DataNode
B.3. HDFS write path
B.4. HDFS read path
B.5. Resilience to hardware failures via replication
B.6. Splitting files across multiple DataNodes
Index
List of Figures
List of Tables
List of Listings
Foreword
At a high level, HBase is like the atomic bomb. Its basic operation can be explained on the back of a napkin over a drink (or two). Its deployment is another matter.
HBase is composed of multiple moving parts. The distributed HBase application is made up of client and server processes. Then there is the Hadoop Distributed File System (HDFS) to which HBase persists. HBase uses yet another distributed system, Apache ZooKeeper, to manage its cluster state. Most deployments throw in Map-Reduce to assist with bulk loading or running distributed full-table scans. It can be tough to get all the pieces pulling together in any approximation of harmony.
Setting up the proper environment and configuration for HBase is critical. HBase is a general data store that can be used in a wide variety of applications. It ships with defaults that are conservatively targeted at a common use case and a generic hardware profile. Its ergonomic ability—its facility for self-tuning—is still under development, so you have to match HBase to the hardware and loading, and this configuration can take a couple of attempts to get right.
But proper configuration isn’t enough. If your HBase data-schema model is out of alignment with how the data store is being queried, no amount of configuration can compensate. You can achieve huge improvements when the schema agrees with how the data is queried. If you come from the realm of relational databases, you aren’t used to modeling schema. Although there is some overlap, making a columnar data store like HBase hum involves a different bag of tricks from those you use to tweak, say, MySQL.
If you need help with any of these dimensions, or with others such as how to add custom functionality to the HBase core or what a well-designed HBase application should look like, this is the book for you. In this timely, very practical text, Amandeep and Nick explain in plain language how to use HBase. It’s the book for those looking to get a leg up in deploying HBase-based applications.
Nick and Amandeep are the lads to learn from. They’re both long-time HBase practitioners. I recall the time Amandeep came to one of our early over-the-weekend Hackathons in San Francisco—a good many years ago now—where a few of us huddled around his well-worn ThinkPad trying to tame his RDF on an early version of an HBase student project.
He has been paying the HBase community back ever since by helping others on the project mailing lists. Nick showed up not long after and has been around the HBase project in one form or another since that time, mostly building stuff on top of it. These boys have done the HBase community a service by taking the time out to research and codify their experience in a book.
You could probably get by with this text and an HBase download, but then you’d miss out on what’s best about HBase. A functional, welcoming community of developers has grown up around the HBase project and is all about driving the project forward. This community is what we—members such as myself and the likes of Amandeep and Nick—are most proud of. Although some big players contribute to HBase’s forward progress—Facebook, Huawei, Cloudera, and Salesforce, to name a few—it’s not the corporations that make a community. It’s the participating individuals who make HBase what it is. You should consider joining us. We’d love to have you.
MICHAEL STACK
CHAIR OF THE APACHE HBASE
PROJECT MANAGEMENT COMMITTEE
Letter to the HBase Community
Before we examine the current situation, please allow me to flash back a few years and look at the beginnings of HBase.
In 2007, when I was faced with using a large, scalable data store at literally no cost—because the project’s budget would not allow it—only a few choices were available. You could either use one of the free databases, such as MySQL or PostgreSQL, or a pure key/value store like Berkeley DB. Or you could develop something on your own and open up the playing field—which of course only a few of us were bold enough to attempt, at least in those days.
These solutions might have worked, but one of the major concerns was scalability. This feature wasn’t well developed and was often an afterthought to the existing systems. I had to store billions of documents, maintain a search index on them, and allow random updates to the data, while keeping index updates short. This led me to the third choice available that year: Hadoop and HBase.
Both had a strong pedigree, and they came out of Google, a Valhalla of the best talent that could be gathered when it comes to scalable systems. My belief was that if these systems could serve an audience as big as the world, their underlying foundations must be solid. Thus, I proposed to built my project with HBase (and Lucene, as a side note).
Choices were easy back in 2007. But as we flash forward through the years, the playing field grew, and we saw the advent of many competing, or complementing, solutions. The term NoSQL was used to group the increasing number of distributed databases under a common umbrella. A long and sometimes less-than-useful discussion arose around that name alone; to me, what mattered was that the available choices increased rapidly.
The next attempt to frame the various nascent systems was based on how their features compared: strongly consistent versus eventual consistent models, which were built to fulfill specific needs. People again tried to put HBase and its peers into this perspective: for example, using Eric Brewer’s CAP theorem. And yet again a heated discussion ensued about what was most important: being strongly consistent or being able to still serve data despite catastrophic, partial system failures.
And as before, to me, it was all about choices—but I learned that you need to fully understand a system before you can use it. It’s not about slighting other solutions as inferior; today we have a plentiful selection, with overlapping qualities. You have to become a specialist to distinguish them and make the best choice for the problem at hand.
This leads us to HBase and the current day. Without a doubt, its adoption by well-known, large web companies has raised its profile, proving that it can handle the given use cases. These companies have an important advantage: they employ very skilled engineers. On the other hand, a lot of smaller or less fortunate companies struggle to come to terms with HBase and its applications. We need someone to explain in plain, no-nonsense terms how to build easily understood and reoccurring use cases on top of HBase.
How do you design the schema to store complex data patterns, to trade between read and write performance? How do you lay out the data’s access patterns to saturate your HBase cluster to its full potential? Questions like these are a dime a dozen when you follow the public mailing lists. And that is where Amandeep and Nick come in. Their wealth of real-world experience at making HBase work in a variety of use cases will help you understand the intricacies of using the right data schema and access pattern to successfully build your next project.
What does the future of HBase hold? I believe it holds great things! The same technology is still powering large numbers of products and systems at Google, naysayers of the architecture have been proven wrong, and the community at large has grown into one of the healthiest I’ve ever been involved in. Thank you to all who have treated me as a fellow member; to those who daily help with patches and commits to make HBase even better; to companies that willingly sponsor engineers to work on HBase full time; and to the PMC of HBase, which is the absolutely most sincere group of people I have ever had the opportunity know—you rock.
And finally a big thank-you to Nick and Amandeep for writing this book. It contributes to the value of HBase, and it opens doors and minds. We met before you started writing the book, and you had some concerns. I stand by what I said then: this is the best thing you could have done for HBase and the community. I, for one, am humbled and proud to be part of it.
LARS GEORGE
HBASE COMMITTER
Preface
I got my start with HBase in the fall of 2008. It was a young project then, released only in the preceding year. As early releases go, it was quite capable, although not without its fair share of embarrassing warts. Not bad for an Apache subproject with fewer than 10 active committers to its name! That was the height of the NoSQL hype. The term NoSQL hadn’t even been presented yet but would come into common parlance over the next year. No one could articulate why the idea was important—only that it was important—and everyone in the open source data community was obsessed with this concept. The community was polarized, with people either bashing relational databases for their foolish rigidity or mocking these new technologies for their lack of sophistication.
The people exploring this new idea were mostly in internet companies, and I came to work for such a company—a startup interested in the analysis of social media content. Facebook still enforced its privacy policies then, and Twitter wasn’t big enough to know what a Fail Whale was yet. Our interest at the time was mostly in blogs. I left a company where I’d spent the better part of three years working on a hierarchical database engine. We made extensive use of Berkeley DB, so I was familiar with data technologies that didn’t have a SQL engine. I joined a small team tasked with building a new data-management platform. We had an MS SQL database stuffed to the gills with blog posts and comments. When our daily analysis jobs breached the 18-hour mark, we knew the current system’s days were numbered.
After cataloging a basic set of requirements, we set out to find a new data technology. We were a small team and spent months evaluating different options while maintaining current systems. We experimented with different approaches and learned firsthand the pains of manually partitioning data. We studied the CAP theorem and eventual consistency—and the tradeoffs. Despite its warts, we decided on HBase, and we convinced our manager that the potential benefits outweighed the risks he saw in open source technology.
I’d played a bit with Hadoop at home but had never written a real MapReduce job. I’d heard of HBase but wasn’t particularly interested in it until I was in this new position. With the clock ticking, there was nothing to do but jump in. We scrounged up a couple of spare machines and a bit of rack, and then we were off and running. It was a .NET shop, and we had no operational help, so we learned to combine bash with rsync and managed the cluster ourselves.
I joined the mailing lists and the IRC channel and started asking questions. Around this time, I met Amandeep. He was working on his master’s thesis, hacking up HBase to run on systems other than Hadoop. Soon he finished school, joined Amazon, and moved to Seattle. We were among the very few HBase-ers in this extremely Microsoft-centric city. Fast-forward another two years...
The idea of HBase in Action was first proposed to us in the fall of 2010. From my perspective, the project was laughable. Why should we, two community members, write a book about HBase? Internally, it’s a complex beast. The Definitive Guide was still a work in progress, but we both knew its author, a committer, and were well aware of the challenge before him. From the outside, I thought it’s just a simple key-value store.
The API has only five concepts, none of which is complex. We weren’t going to write another internals book, and I wasn’t convinced there was enough going on from the application developer’s perspective to justify an entire book.
We started brainstorming the project, and it quickly became clear that I was wrong. Not only was there enough material for a user’s guide, but our position as community members made us ideal candidates to write such a book. We set out to catalogue the useful bits of knowledge we’d each accumulated over the couple of years we’d used the technology. That effort—this book—is the distillation of our eight years of combined HBase experience. It’s targeted to those brand new to HBase, and it provides guidance over the stumbling blocks we encountered during our own journeys. We’ve collected and codified as much as we could of the tribal knowledge floating around the community. Wherever possible, we prefer concrete direction to vague advice. Far more than a simple FAQ, we hope you’ll find this book to be a complete manual to getting off the ground with HBase.
HBase is now stabilizing. Most of the warts we encountered when we began with the project have been cleaned up, patched, or completely re-architected. HBase is approaching its 1.0 release, and we’re proud to be part of this community as we approach this milestone. We’re proud to present this manuscript to the community in hopes that it will encourage and enable the next generation of HBase users. The single strongest component of HBase is its thriving community—we hope you’ll join us in that community and help it continue to innovate in this new era of data systems.
NICK DIMIDUK
If you’re reading this, you’re presumably interested in knowing how I got involved with HBase. Let me start by saying thank you for choosing this book as your means to learn about HBase and how to build applications that use HBase as their underlying storage system. I hope you’ll find the text useful and learn some neat tricks that will help you build better applications and enable you to succeed.
I was pursuing graduate studies in computer science at UC Santa Cruz, specializing in distributed systems, when I started working at Cisco as a part-time researcher. The team I was working with was trying to build a data-integration framework that could integrate, index, and allow exploration of data residing in hundreds of heterogeneous data stores, including but not limited to large RDBMS systems. We started looking for systems and solutions that would help us solve the problems at hand. We evaluated many different systems, from object databases to graph databases, and we considered building a custom distributed data-storage layer backed by Berkeley DB. It was clear that one of the key requirements was scalability, and we didn’t want to build a full-fledged distributed system. If you’re in a situation where you think you need to build out a custom distributed database or file system, think again—try to see if an existing solution can solve part of your problem.
Following that principle, we decided that building out a new system wasn’t the best approach and to use an existing technology instead. That was when I started playing with the Hadoop ecosystem, getting my hands dirty with the different components in the stack and going on to build a proof-of-concept for the data-integration system on top of HBase. It actually worked and scaled well! HBase was well-suited to the problem, but these were young projects at the time—and one of the things that ensured our success was the community. HBase has one of the most welcoming and vibrant open source communities; it was much smaller at the time, but the key principles were the same then as now.
The data-integration project later became my master’s thesis. The project used HBase at its core, and I became more involved with the community as I built it out. I asked questions, and, with time, answered questions others asked, on both the mailing lists and the IRC channel. This is when I met Nick and got to know what he was working on. With each day that I worked on this project, my interest and love for the technology and the open source community grew, and I wanted to stay involved.
After finishing grad school, I joined Amazon in Seattle to work on back-end distributed systems projects. Much of my time was spent with the Elastic MapReduce team, building the first versions of their hosted HBase offering. Nick also lived in Seattle, and we met often and talked about the projects we were working on. Toward the end of 2010, the idea of writing HBase in Action for Manning came up. We initially scoffed at the thought of writing a book on HBase, and I remember saying to Nick, It’s gets, puts, and scans—there’s not a lot more to HBase from the client side. Do you want to write a book about three API calls?
But the more we thought about this, the more we realized that building applications with HBase was challenging and there wasn’t enough material to help people get off the ground. That limited the adoption of the project. We decided that more material on how to effectively use HBase would help users of the system build the applications they need. It took a while for the idea to materialize; in fall 2011, we finally got started.
Around this time, I moved to San Francisco to join Cloudera and was exposed to many applications that were built on top of HBase and the Hadoop stack. I brought what I knew, combined it with what I had learned over the last couple of years working with HBase and pursuing my master’s, and distilled that into concepts that became part of the manuscript for the book you’re now reading. HBase has come a long way in the last couple of years and has seen many big players adopt it as a core part of their stack. It’s more stable, faster, and easier to operationalize than it has ever been, and the project is fast approaching its 1.0 release.
Our intention in writing this book was to make learning HBase more approachable, easier, and more fun. As you learn more about the system, we encourage you to get involved with the community and to learn beyond what the book has to offer—to write blog posts, contribute code, and share your experiences to help drive this great open source project forward in every way possible. Flip open the book, start reading, and welcome to HBaseland!
AMANDEEP KHURANA
Acknowledgments
Working on this book has been a humbling reminder that we, as users, stand on the shoulders of giants. HBase and Hadoop couldn’t exist if not for those papers published by Google nearly a decade ago. HBase wouldn’t exist if not for the many individuals who picked up those papers and used them as inspiration to solve their own challenges. To every HBase and Hadoop contributor, past and present: we thank you. We’re especially grateful to the HBase committers. They continue to devote their time and effort to one of the most state-of-the-art data technologies in existence. Even more amazing, they give away the fruit of that effort to the wider community. Thank you.
This book would not have been possible without the entire HBase community. HBase enjoys one of the largest, most active, and most welcoming user communities in NoSQL. Our thanks to everyone who asks questions on the mailing list and who answers them in kind. Your welcome and willingness to answer questions encouraged us to get involved in the first place. Your unabashed readiness to post questions and ask for help is the foundation for much of the material we distill and clarify in this book. We hope to return the favor by expanding awareness of and the audience for HBase.
We’d like to thank specifically the many HBase committers and community members who helped us through this process. Special thanks to Michael Stack, Lars George, Josh Patterson, and Andrew Purtell for the encouragement and the reminders of the value a user’s guide to HBase could bring to the community. Ian Varley, Jonathan Hsieh, and Omer Trajman contributed in the form of ideas and feedback. The chapter on OpenTSDB and the section on asynchbase were thoroughly reviewed by Benoît Sigoure; thank you for your code and your comments. And thanks to Michael for contributing the foreword to our book and to Lars for penning the letter to the HBase community.
We’d also like to thank our respective employers (Cloudera, Inc., and The Climate Corporation) not just for being supportive but also for providing encouragement, without which finishing the manuscript would not have been possible.
At Manning, we thank our editors Renae Gregoire and Susanna Kline. You saw us through from a rocky start to the successful completion of this book. We hope your other projects aren’t as exciting as ours! Thanks also to our technical editor Mark Henry Ryan and our technical proofreaders Jerry Kuch and Kristine Kuch.
The following peer reviewers read the manuscript at various stages of its development and we would like to thank them for their insightful feedback: Aaron Colcord, Adam Kawa, Andy Kirsch, Bobby Abraham, Bruno Dumon, Charles Pyle, Cristofer Weber, Daniel Bretoi, Gianluca Righetto, Ian Varley, John Griffin, Jonathan Miller, Keith Kim, Kenneth DeLong, Lars Francke, Lars Hofhansl, Paul Stusiak, Philipp K. Janert, Robert J. Berger, Ryan Cox, Steve Loughran, Suraj Varma, Trey Spiva, and Vinod Panicker.
Last but not the least—no project is complete without recognition of family and friends, because such a project can’t be completed without the support of loved ones. Thank you all for your support and patience throughout this adventure.
About this Book
HBase sits at the top of a stack of complex distributed systems including Apache Hadoop and Apache ZooKeeper. You need not be an expert in all these technologies to make effective use of HBase, but it helps to have an understanding of these foundational layers in order to take full advantage of HBase. These technologies were inspired by papers published by Google. They’re open source clones of the technologies described in these publications. Reading these academic papers isn’t a prerequisite for using HBase or these other technologies; but when you’re learning a technology, it can be helpful to understand the problems that inspired its invention. This book doesn’t assume you’re familiar with these technologies, nor does it assume you’ve read the associated papers.
HBase in Action is a user’s guide to HBase, nothing more and nothing less. It doesn’t venture into the bowels of the internal HBase implementation. It doesn’t cover the broad range of topics necessary for understanding the Hadoop ecosystem. HBase in Action maintains a singular focus on using HBase. It aims to educate you enough that you can build an application on top of HBase and launch that application into production. Along the way, you’ll learn some of those HBase implementation details. You’ll also become familiar with other parts of Hadoop. You’ll learn enough to understand why HBase behaves the way it does, and you’ll be able to ask intelligent questions. This book won’t turn you into an HBase committer. It will give you a practical introduction to HBase.
Roadmap
HBase in Action is organized into four parts. The first two are about using HBase. In these six chapters, you’ll go from HBase novice to fluent in writing applications on HBase. Along the way, you’ll learn about the basics, schema design, and how to use the most advanced features of HBase. Most important, you’ll learn how to think in HBase. The two chapters in part 3 move beyond sample applications and give you a taste of HBase in real applications. Part 4 is aimed at taking your HBase application from a development prototype to a full-fledged production system.
Chapter 1 introduces the origins of Hadoop, HBase, and NoSQL in general. We explain what HBase is and isn’t, contrast HBase with other NoSQL databases, and describe some common use cases. We’ll help you decide if HBase is the right technology choice for your project and organization. Chapter 1 concludes with a simple HBase install and gets you started with storing data.
Chapter 2 kicks off a running sample application. Through this example, we explore the foundations of using HBase. Creating tables, storing and retrieving data, and the HBase data model are all covered. We also explore enough HBase internals to understand how data is organized in HBase and how you can take advantage of that knowledge in your own applications.
Chapter 3 re-introduces HBase as a distributed system. This chapter explores the relationship between HBase, Hadoop, and ZooKeeper. You’ll learn about the distributed architecture of HBase and how that translates into a powerful distributed data system. The use cases for using HBase with Hadoop MapReduce are explored with hands-on examples.
Chapter 4 is dedicated to HBase schema design. This complex topic is explained using the example application. You’ll see how table design decisions affect the application and how to avoid common mistakes. We’ll map any existing relational database knowledge you have into the HBase world. You’ll also see how to work around an imperfect schema design using server-side filters. This chapter also covers the advanced physical configuration options exposed by HBase.
Chapter 5 introduces coprocessors, a mechanism for pushing computation out to your HBase cluster. You’ll extend the sample application in two different ways, building new application features into the cluster itself.
Chapter 6 is a whirlwind tour of alternative HBase clients. HBase is written in Java, but that doesn’t mean your application must be. You’ll interact with the sample application from a variety of languages and over a number of different network protocols.
Part 3 starts with Chapter 7, which opens a real-world, production-ready application. You’ll learn a bit about the problem domain and the specific challenges the application solves. Then we dive deep into the implementation and don’t skimp on the technical details. If ever there was a front-to-back exploration of an application built on HBase, this is it.
Chapter 8 shows you how to map HBase onto a new problem domain. We get you up to speed on that domain, GIS, and then show you how to tackle domain-specific challenges in a scalable way with HBase. The focus is on a domain-specific schema design and making maximum use of scans and filters. No previous GIS experience is expected, but be prepared to use most of what you’ve learned in the previous chapters.
In part 4, chapter 9 bootstraps your HBase cluster. Starting from a blank slate, we show you how to tackle your HBase deployment. What kind of hardware, how much hardware, and how to allocate that hardware are all fair game in this chapter. Considering the cloud? We cover that too. With hardware determined, we show you how to configure your cluster for a basic deployment and how to get everything up and running.
Chapter 10 rolls your deployment into production. We show you how to keep an eye on your cluster through metrics and monitoring tools. You’ll see how to further tune your cluster for performance, based on your application workloads. We show you how to administer the needs of your cluster, keep it healthy, diagnose and fix it when it’s sick, and upgrade it when the time comes. You’ll learn to use the bundled tools for managing data backups and restoration, and how to configure multi-cluster replication.