Apache Hive Essentials
By Dayong Du
()
About this ebook
- Discover how Hive can coexist and work with other tools in the Hadoop ecosystem to create big data solutions
- Grasp the skills needed, learn the best practices, and avoid the pitfalls in writing efficient Hive queries to analyze the big data
- Create an environment to analyze big data using practical, example-oriented scenarios
If you are a data analyst, developer, or simply someone who wants to use Hive to explore and analyze data in Hadoop, this is the book for you. Whether you are new to big data or an expert, with this book, you will be able to master both the basic and the advanced features of Hive. Since Hive is an SQL-like language, some previous experience with the SQL language and databases is useful to have a better understanding of this book.
Related to Apache Hive Essentials
Related ebooks
Snowflake Cookbook: Techniques for building modern cloud data warehousing solutions Rating: 0 out of 5 stars0 ratingsPostgreSQL 11 Administration Cookbook: Over 175 recipes for database administrators to manage enterprise databases Rating: 0 out of 5 stars0 ratingsBig data Hadoop Interview Guide Rating: 0 out of 5 stars0 ratingsApache ZooKeeper Essentials Rating: 5 out of 5 stars5/5Implementing Cloud Design Patterns for AWS Rating: 0 out of 5 stars0 ratingsHadoop Essentials Rating: 5 out of 5 stars5/5Apache Spark 2.x Cookbook Rating: 0 out of 5 stars0 ratingsHadoop Real-World Solutions Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsHadoop in Practice Rating: 0 out of 5 stars0 ratingsApache Hive Cookbook Rating: 0 out of 5 stars0 ratingsHDInsight Essentials - Second Edition Rating: 0 out of 5 stars0 ratingsAzure Databricks A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsDatabricks A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsNeo4j High Performance Rating: 0 out of 5 stars0 ratingsData Pipelines A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsMastering Hadoop Rating: 0 out of 5 stars0 ratingsHadoop BIG DATA Interview Questions You'll Most Likely Be Asked Rating: 0 out of 5 stars0 ratingsData Analysis with Python and PySpark Rating: 0 out of 5 stars0 ratingsData Lake for Enterprises Rating: 0 out of 5 stars0 ratingsLearning PySpark Rating: 0 out of 5 stars0 ratingsGetting Started with Talend Open Studio for Data Integration Rating: 0 out of 5 stars0 ratingsSpark Cookbook Rating: 0 out of 5 stars0 ratingsNeo4j Cookbook Rating: 0 out of 5 stars0 ratingsPython High Performance - Second Edition Rating: 0 out of 5 stars0 ratingsPentaho Data Integration Beginner's Guide Rating: 4 out of 5 stars4/5
Databases For You
Python Projects for Everyone Rating: 0 out of 5 stars0 ratingsGrokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Excel 2021 Rating: 4 out of 5 stars4/5Learn SQL in 24 Hours Rating: 5 out of 5 stars5/5ITIL 4: Digital and IT strategy: Reference and study guide Rating: 5 out of 5 stars5/5Practical Data Analysis Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Visualizing Graph Data Rating: 0 out of 5 stars0 ratingsThe AI Bible, Making Money with Artificial Intelligence: Real Case Studies and How-To's for Implementation Rating: 4 out of 5 stars4/5Data Science Strategy For Dummies Rating: 0 out of 5 stars0 ratingsMastering Blockchain Rating: 5 out of 5 stars5/5Learn SAP SD in 24 Hours Rating: 0 out of 5 stars0 ratingsPostgreSQL Development Essentials Rating: 5 out of 5 stars5/5Star Schema The Complete Reference Rating: 0 out of 5 stars0 ratingsJAVA for Beginner's Crash Course: Java for Beginners Guide to Program Java, jQuery, & Java Programming Rating: 4 out of 5 stars4/5Sap/ABAP Hana Programming: Learn to design and build SAP HANA applications with ABAP/4 Rating: 0 out of 5 stars0 ratingsCompTIA DataSys+ Study Guide: Exam DS0-001 Rating: 0 out of 5 stars0 ratingsPhoenix in Action Rating: 0 out of 5 stars0 ratingsSchaum’s Outline of Fundamentals of SQL Programming Rating: 3 out of 5 stars3/5Audit Culture: How Indicators and Rankings are Reshaping the World Rating: 0 out of 5 stars0 ratingsNode.js Design Patterns - Second Edition Rating: 4 out of 5 stars4/5Access 2019 For Dummies Rating: 0 out of 5 stars0 ratingsSpring in Action, Sixth Edition Rating: 5 out of 5 stars5/5Blockchain For Dummies Rating: 5 out of 5 stars5/5MDM for Customer Data: Optimizing Customer Centric Management of Your Business Rating: 0 out of 5 stars0 ratings
Reviews for Apache Hive Essentials
0 ratings0 reviews
Book preview
Apache Hive Essentials - Dayong Du
Table of Contents
Apache Hive Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Overview of Big Data and Hive
A short history
Introducing big data
Relational and NoSQL database versus Hadoop
Batch, real-time, and stream processing
Overview of the Hadoop ecosystem
Hive overview
Summary
2. Setting Up the Hive Environment
Installing Hive from Apache
Installing Hive from vendor packages
Starting Hive in the cloud
Using the Hive command line and Beeline
The Hive-integrated development environment
Summary
3. Data Definition and Description
Understanding Hive data types
Data type conversions
Hive Data Definition Language
Hive database
Hive internal and external tables
Hive partitions
Hive buckets
Hive views
Summary
4. Data Selection and Scope
The SELECT statement
The INNER JOIN statement
The OUTER JOIN and CROSS JOIN statements
Special JOIN – MAPJOIN
Set operation – UNION ALL
Summary
5. Data Manipulation
Data exchange – LOAD
Data exchange – INSERT
Data exchange – EXPORT and IMPORT
ORDER and SORT
Operators and functions
Transactions
Summary
6. Data Aggregation and Sampling
Basic aggregation – GROUP BY
Advanced aggregation – GROUPING SETS
Advanced aggregation – ROLLUP and CUBE
Aggregation condition – HAVING
Analytic functions
Sampling
Summary
7. Performance Considerations
Performance utilities
The EXPLAIN statement
The ANALYZE statement
Design optimization
Partition tables
Bucket tables
Index
Data file optimization
File format
Compression
Storage optimization
Job and query optimization
Local mode
JVM reuse
Parallel execution
Join optimization
Common join
Map join
Bucket map join
Sort merge bucket (SMB) join
Sort merge bucket map (SMBM) join
Skew join
Summary
8. Extensibility Considerations
User-defined functions
The UDF code template
The UDAF code template
The UDTF code template
Development and deployment
Streaming
SerDe
Summary
9. Security Considerations
Authentication
Metastore server authentication
HiveServer2 authentication
Authorization
Legacy mode
Storage-based mode
SQL standard-based mode
Encryption
Summary
10. Working with Other Tools
JDBC / ODBC connector
HBase
Hue
HCatalog
ZooKeeper
Oozie
Hive roadmap
Summary
Index
Apache Hive Essentials
Apache Hive Essentials
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2015
Production reference: 1210215
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78355-857-5
www.packtpub.com
Credits
Author
Dayong Du
Reviewers
Puneetha B M
Hamzeh Khazaei
Nitin Pradeep Kumar
Balaswamy Vaddeman
Commissioning Editor
Ashwin Nair
Acquisition Editor
Shaon Basu
Content Development Editor
Merwyn D'souza
Technical Editor
Taabish Khan
Copy Editors
Sameen Siddiqui
Laxmi Subramanian
Project Coordinator
Neha Bhatnagar
Proofreaders
Paul Hindle
Jonathan Todd
Indexer
Monica Ajmera Mehta
Production Coordinator
Aparna Bhagat
Cover Work
Aparna Bhagat
About the Author
Dayong Du is a big data practitioner, leader, and developer with expertise in technology consulting, designing, and implementing enterprise big data solutions. With more than 10 years of experience in enterprise data warehouse, business intelligence, and big data and analytics, he has provided his data intelligence expertise in various industries, such as media, travel, telecommunications, and so on. He is currently working with QuickPlay Media in Toronto, Canada, to build enterprise big data intelligence reporting for online media services and content providers. He has a master's degree in computer science from Dalhousie University, and he holds the Cloudera Certified Developer for Apache Hadoop certification.
I would like to sincerely thank my wife, Joice, and daughter, Elaine, for their sacrifices and encouragement during this journey. Also, I would like to thank my parents for their support during the time of writing this book.
I would also like to thank everyone at Packt Publishing and the technical reviewers for their valuable help, guidance, and feedback on my book.
About the Reviewers
Puneetha B M is a software engineer, data enthusiast, and technical blogger. Her research interests include big data, cloud computing, machine learning, and NoSQL databases. She is also a professional software engineer with more than 2 years of working experience. She holds a master's degree in computer applications from P.E.S. Institute of Technology. Other than programming, she enjoys painting and listening to music. You can learn more from her blog (http://blog.puneethabm.in/) and LinkedIn profile (https://www.linkedin.com/in/puneethabm).
I owe a great deal to Prof. Dr. Ram Rustagi for being a role model in my life and for his zealous inspiration. I would like to thank my brother, Nischith B.M., for supporting me in everything I do. I would also like to thank Packt Publishing and its staff for providing the opportunity to contribute to this book.
Hamzeh Khazaei is a postdoctoral research scientist at IBM Canada Research and Development Centre. He received his PhD degree in computer science from University of Manitoba, Winnipeg, Manitoba, Canada (2009–2012). Earlier, he received both his BSc and MSc degrees in computer science from Amirkabir University of Technology, Tehran, Iran (2000–2008). He is also a sessional instructor in the Computer Science department at Ryerson University (http://scs.ryerson.ca/~hkhazaei). He teaches software engineering to fourth year undergraduate students. His research area includes big data analytics, cloud computing infrastructure, analytics as a service, and modeling of computing systems.
I would like to thank my dear wife for her perpetual support in all my endeavors.
Nitin Pradeep Kumar is a passionate developer with extensive experience and oodles of interest in emerging technologies such as the cloud and mobile. He is currently a cloud quality engineer at Appcelerator, a leading Silicon Valley-based start-up that provides an MBaaS platform purpose-built for mobile and cloud development. Before this stint, he studied at the National University of Singapore toward a master's degree in knowledge engineering, which involves building intelligent systems using cutting-edge artificial intelligence and data-mining techniques. He enjoys the start-up environment and has worked with technologies such as Hadoop, Hive, and data warehousing. He lives in Singapore and spends his spare cycles playing retro PC games on his mobile and learning Muay Thai.
I would like to thank my family, friends, and my wonderful brother, Nivin, for supporting me in all my endeavors.
Balaswamy Vaddeman is a Hadoop hackathon winner for Hyderabad in 2013. He is one of the top contributors on the Hive tag at http://www.stackoverflow.com. He is a big data professional with 3 years of experience. He is well known for training people on big data/Hadoop. So far, he has delivered six big data projects. He is a Java/J2EE expert with 8 years of IT experience and 5 years of RDBMS experience. He is an automation expert on Unix-based systems using Shell scripting. He has experience in setting up teams and bringing them up to speed on big data projects. He is an active participant in Hadoop/big data forums.
I would like to thank my wife, Radha, my son, Pandu, and my daughter, Bubly, for their cooperation in completing this book.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
I dedicate this book to my daughter
Preface
With an increasing interest in big data analysis, Hive over Hadoop becomes a cutting-edge data solution for storing, computing, and analyzing big data. The SQL-like syntax makes Hive easier to learn and popularly accepted as a standard for interactive SQL queries over big data. The variety of features available within Hive provides us with the capability of doing complex big data analysis without advanced coding skills. The maturity of Hive lets it gradually merge and share its valuable architecture and functionalities across different computing frameworks beyond Hadoop.
Apache Hive Essentials prepares your journey to big data by covering the introduction of backgrounds and concepts in the big data domain along with the process of setting up and getting familiar with your Hive working environment in the first two chapters. In the next four chapters, the book guides you through discovering and transforming the value behind big data by examples and skills of Hive query languages. In the last four chapters, the book highlights well-selected and advanced topics, such as performance, security, and extensions as exciting adventures for this worthwhile big data journey.
What this book covers
Chapter 1, Overview of Big Data and Hive, introduces the evolution of big data, the Hadoop ecosystem, and Hive. You will also learn the Hive architecture and the advantages of using Hive in big data analysis.
Chapter 2, Setting Up the Hive Environment, describes the Hive environment setup and configuration. It also covers using Hive through the command line and development tools.
Chapter 3, Data Definition and Description, introduces the basic data types and data definition language for tables, partitions, buckets, and views in Hive.
Chapter 4, Data Selection and Scope, shows you ways to discover the data by querying, linking, and scoping the data in Hive.
Chapter 5, Data Manipulation, describes the process of exchanging, moving, sorting, and transforming the data in Hive.
Chapter 6, Data Aggregation and Sampling, explains how to do aggregation and sample using aggregation functions, analytic functions, windowing, and sample clauses.
Chapter 7, Performance Considerations, introduces the best practices of performance considerations in the aspects of design, file format, compression, storage, query, and job.
Chapter 8, Extensibility Considerations, describes how to extend Hive by creating user-defined functions, streaming, serializers, and deserializers.
Chapter 9, Security Considerations, introduces the area of Hive security in terms of authentication, authorization, and encryption.
Chapter 10, Working with Other Tools, discusses how Hive works with other big data tools. It also reviews the key milestones of Hive releases.
What you need for this book
You will need to install both Hadoop and Hive to run the examples in this book. The scripts in this book were written and tested with Cloudera Distributed Hadoop (CDH) v5.3 (contains Hive v0.13.x and Hadoop v2.5.0), Hortonworks Data Platform (HDP) v2.2 (contains Hive v0.14.0 and Hadoop v2.6.0), and Apache Hive 1.0.0 (with Hadoop 1.2.1) in pseudo-distributed mode. However, the majority of the scripts will also run on the previous versions of Hadoop and Hive. The following are the other software applications you may need for a better understanding of the Hive-related tools mentioned in the book. These tools are also available in the CDH or HDP packages.
Hue 2.2.0 and above
HBase 0.98.4
Oozie 4.0.0 and above
Zookeeper 3.4.5
Tez 0.6.0
Who this book is for
If you are a data analyst, developer, and user who wants to use Hive to explore and analyze data in Hadoop, this is the book for you. Whether you are new to big data or an expert, you will be able to master both the basic and the advanced features of Hive. Since Hive is an SQL-like language, some previous experience with the SQL language and database is useful to have a better understanding of this book.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Aggregate function can be used with other aggregate functions in the same select statement.
A block of code is set as follows:
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
customAuthenticator.java
package com.packtpub.hive.essentials.hiveudf;
import java.util.Hashtable;
import javax.security.sasl.AuthenticationException;
import org.apache.hive.service.auth.PasswdAuthenticationProvider;
Any command-line input or output is written as follows:
bash-4.1$ hdfs dfs –mkdir /tmp
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Click on the OK button and restart Oracle SQL Developer.
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book