10.1007/978 1 4842 1910 2 PDF
10.1007/978 1 4842 1910 2 PDF
10.1007/978 1 4842 1910 2 PDF
Data Analytics
Designing and Building Big Data
Systems using the Hadoop Ecosystem
—
Kerry Koitzsch
Pro Hadoop Data
Analytics
Designing and Building Big Data Systems
using the Hadoop Ecosystem
Kerry Koitzsch
Pro Hadoop Data Analytics: Designing and Building Big Data Systems using the Hadoop Ecosystem
Kerry Koitzsch
Sunnyvale, California, USA
ISBN-13 (pbk): 978-1-4842-1909-6 ISBN-13 (electronic): 978-1-4842-1910-2
DOI 10.1007/978-1-4842-1910-2
Library of Congress Control Number: 2016963203
Copyright © 2017 by Kerry Koitzsch
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or
hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material
contained herein.
Managing Director: Welmoed Spahr
Lead Editor: Celestin Suresh John
Technical Reviewer: Simin Boschma
Editorial Board: Steve Anglin, Pramila Balan, Laura Berendson, Aaron Black, Louise Corrigan,
Jonathan Gennick, Robert Hutchinson, Celestin Suresh John, Nikhil Karkal, James Markham,
Susan McDermott, Matthew Moodie, Natalie Pao, Gwenan Spearing
Coordinating Editor: Prachi Mehta
Copy Editor: Larissa Shmailo
Compositor: SPi Global
Indexer: SPi Global
Artist: SPi Global
Distributed to the book trade worldwide by Springer Science+Business Media New York,
233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail
[email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC
and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM
Finance Inc is a Delaware corporation.
For information on translations, please e-mail [email protected], or visit www.apress.com.
Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. eBook
versions and licenses are also available for most titles. For more information, reference our Special Bulk
Sales–eBook Licensing web page at www.apress.com/bulk-sales.
Any source code or other supplementary materials referenced by the author in this text are available to
readers at www.apress.com. For detailed information about how to locate your book’s source code, go to
www.apress.com/source-code/. Readers can also access source code at SpringerLink in the Supplementary
Material section for each chapter.
Printed on acid-free paper
To Sarvnaz, whom I love.
Contents at a Glance
■
■Part I: Concepts���������������������������������������������������������������������������������� 1
■
■Chapter 1: Overview: Building Data Analytic Systems with Hadoop��������������������� 3
■
■Chapter 2: A Scala and Python Refresher����������������������������������������������������������� 29
■
■Chapter 3: Standard Toolkits for Hadoop and Analytics�������������������������������������� 43
■
■Chapter 4: Relational, NoSQL, and Graph Databases������������������������������������������� 63
■
■Chapter 5: Data Pipelines and How to Construct Them��������������������������������������� 77
■
■Chapter 6: Advanced Search Techniques with Hadoop, Lucene, and Solr����������� 91
■
■Part II: Architectures and Algorithms��������������������������������������������� 137
■
■Chapter 7: An Overview of Analytical Techniques and Algorithms������������������� 139
■
■Chapter 8: Rule Engines, System Control, and System Orchestration��������������� 151
■
■Chapter 9: Putting It All Together: Designing a Complete Analytical System���� 165
■
■Part III: Components and Systems������������������������������������������������� 177
■
■Chapter 10: Data Visualizers: Seeing and Interacting with the Analysis���������� 179
v
■ Contents at a Glance
■
■Part IV: Case Studies and Applications������������������������������������������ 201
■
■Chapter 11: A Case Study in Bioinformatics: Analyzing Microscope Slide Data203
■
■Chapter 12: A Bayesian Analysis Component: Identifying Credit Card Fraud���� 215
■■Chapter 13: Searching for Oil: Geographical
Data Analysis with Apache Mahout����������������������������������������������������������������������� 223
■
■Chapter 14: “Image As Big Data” Systems: Some Case Studies����������������������� 235
■
■Chapter 15: Building a General Purpose Data Pipeline������������������������������������� 257
■
■Chapter 16: Conclusions and the Future of Big Data Analysis�������������������������� 263
■
■Appendix A: Setting Up the Distributed Analytics Environment������������������������ 275
■■Appendix B: Getting, Installing, and Running
the Example Analytics System�������������������������������������������������������������������������� 289
Index��������������������������������������������������������������������������������������������������������������������� 291
vi
Contents
■
■Part I: Concepts���������������������������������������������������������������������������������� 1
■
■Chapter 1: Overview: Building Data Analytic Systems with Hadoop��������������������� 3
1.1 A Need for Distributed Analytical Systems�������������������������������������������������������������� 4
1.2 The Hadoop Core and a Small Amount of History��������������������������������������������������� 5
1.3 A Survey of the Hadoop Ecosystem������������������������������������������������������������������������ 5
1.4 AI Technologies, Cognitive Computing, Deep Learning, and Big Data Analysis������� 7
1.5 Natural Language Processing and BDAs����������������������������������������������������������������� 7
1.6 SQL and NoSQL Querying���������������������������������������������������������������������������������������� 7
1.7 The Necessary Math����������������������������������������������������������������������������������������������� 8
1.8 A Cyclic Process for Designing and Building BDA Systems������������������������������������ 8
1.9 How The Hadoop Ecosystem Implements Big
Data Analysis��������������������������������������������������������������������������������������������������������� 11
1.10 The Idea of “Images as Big Data” (IABD)������������������������������������������������������������� 11
1.10.1 Programming Languages Used���������������������������������������������������������������������������������������������� 13
1.10.2 Polyglot Components of the Hadoop Ecosystem�������������������������������������������������������������������� 13
1.10.3 Hadoop Ecosystem Structure������������������������������������������������������������������������������������������������� 14
vii
■ Contents
1.16 Summary������������������������������������������������������������������������������������������������������������� 26
■
■Chapter 2: A Scala and Python Refresher����������������������������������������������������������� 29
2.1 Motivation: Selecting the Right Language(s) Defines the Application������������������� 29
2.1.1 Language Features—a Comparison����������������������������������������������������������������������������������������� 30
2.2 Review of Scala����������������������������������������������������������������������������������������������������� 31
2.2.1 Scala and its Interactive Shell�������������������������������������������������������������������������������������������������� 31
2.3 Review of Python�������������������������������������������������������������������������������������������������� 36
2.4 Troubleshoot, Debug, Profile, and Document��������������������������������������������������������� 39
2.4.1 Debugging Resources in Python����������������������������������������������������������������������������������������������� 40
2.4.2 Documentation of Python��������������������������������������������������������������������������������������������������������� 41
2.4.3 Debugging Resources in Scala������������������������������������������������������������������������������������������������� 41
viii
■ Contents
■
■Part II: Architectures and Algorithms��������������������������������������������� 137
■
■Chapter 7: An Overview of Analytical Techniques and Algorithms������������������� 139
7.1 Survey of Algorithm Types����������������������������������������������������������������������������������� 139
7.2 Statistical / Numerical Techniques���������������������������������������������������������������������� 141
7.3 Bayesian Techniques������������������������������������������������������������������������������������������� 142
7.4 Ontology Driven Algorithms��������������������������������������������������������������������������������� 143
7.5 Hybrid Algorithms: Combining Algorithm Types ������������������������������������������������� 145
7.6 Code Examples���������������������������������������������������������������������������������������������������� 146
7.7 Summary������������������������������������������������������������������������������������������������������������� 150
7.8 References���������������������������������������������������������������������������������������������������������� 150
■
■Chapter 8: Rule Engines, System Control, and System Orchestration��������������� 151
8.1 Introduction to Rule Systems: JBoss Drools������������������������������������������������������� 151
8.2 Rule-based Software Systems Control��������������������������������������������������������������� 156
8.3 System Orchestration with JBoss Drools������������������������������������������������������������ 157
8.4 Analytical Engine Example with Rule Control������������������������������������������������������ 160
8.5 Summary������������������������������������������������������������������������������������������������������������� 163
8.6 References���������������������������������������������������������������������������������������������������������� 164
■
■Chapter 9: Putting It All Together: Designing a Complete Analytical System���� 165
9.1 Summary������������������������������������������������������������������������������������������������������������� 175
9.2 References���������������������������������������������������������������������������������������������������������� 175
x
■ Contents
■
■Part III: Components and Systems������������������������������������������������� 177
■
■Chapter 10: Data Visualizers: Seeing and Interacting with the Analysis���������� 179
10.1 Simple Visualizations���������������������������������������������������������������������������������������� 179
10.2 Introducing Angular JS and Friends������������������������������������������������������������������ 186
10.3 Using JHipster to Integrate Spring XD and Angular JS������������������������������������� 186
10.4 Using d3.js, sigma.js and Others����������������������������������������������������������������������� 197
10.5 Summary����������������������������������������������������������������������������������������������������������� 199
10.6 References�������������������������������������������������������������������������������������������������������� 200
■
■Part IV: Case Studies and Applications������������������������������������������ 201
■■Chapter 11: A Case Study in Bioinformatics:
Analyzing Microscope Slide Data���������������������������������������������������������������������� 203
11.1 Introduction to Bioinformatics��������������������������������������������������������������������������� 203
11.2 Introduction to Automated Microscopy������������������������������������������������������������� 206
11.3 A Code Example: Populating HDFS with Images����������������������������������������������� 210
11.4 Summary����������������������������������������������������������������������������������������������������������� 213
11.5 References�������������������������������������������������������������������������������������������������������� 214
■
■Chapter 12: A Bayesian Analysis Component: Identifying Credit Card Fraud���� 215
12.1 Introduction to Bayesian Analysis��������������������������������������������������������������������� 215
12.2 A Bayesian Component for Credit Card Fraud Detection����������������������������������� 218
12.2.1 The Basics of Credit Card Validation������������������������������������������������������������������������������������� 218
12.3 Summary����������������������������������������������������������������������������������������������������������� 221
12.4 References�������������������������������������������������������������������������������������������������������� 221
■■Chapter 13: Searching for Oil: Geographical
Data Analysis with Apache Mahout����������������������������������������������������������������������� 223
13.1 Introduction to Domain-Based Apache Mahout Reasoning������������������������������� 223
13.2 Smart Cartography Systems and Hadoop Analytics������������������������������������������ 231
13.3 Summary����������������������������������������������������������������������������������������������������������� 233
13.4 References�������������������������������������������������������������������������������������������������������� 233
xi
■ Contents
■
■Chapter 14: “Image As Big Data” Systems: Some Case Studies����������������������� 235
14.1 An Introduction to Images as Big Data�������������������������������������������������������������� 235
14.2 First Code Example Using the HIPI System������������������������������������������������������� 238
14.3 BDA Image Toolkits Leverage Advanced Language Features���������������������������� 242
14.4 What Exactly are Image Data Analytics?����������������������������������������������������������� 243
14.5 Interaction Modules and Dashboards��������������������������������������������������������������� 245
14.6 Adding New Data Pipelines and Distributed Feature Finding���������������������������� 246
14.7 Example: A Distributed Feature-finding Algorithm�������������������������������������������� 246
14.8 Low-Level Image Processors in the IABD Toolkit���������������������������������������������� 252
14.9 Terminology ������������������������������������������������������������������������������������������������������ 253
14.10 Summary��������������������������������������������������������������������������������������������������������� 254
14.11 References������������������������������������������������������������������������������������������������������ 254
■
■Chapter 15: Building a General Purpose Data Pipeline������������������������������������� 257
15.1 Architecture and Description of an Example System���������������������������������������� 257
15.2 How to Obtain and Run the Example System���������������������������������������������������� 258
15.3 Five Strategies for Pipeline Building����������������������������������������������������������������� 258
15.3.1 Working from Data Sources and Sinks��������������������������������������������������������������������������������� 258
15.3.2 Middle-Out Development������������������������������������������������������������������������������������������������������ 259
15.3.3 Enterprise Integration Pattern (EIP)-based Development����������������������������������������������������� 259
15.3.4 Rule-based Messaging Pipeline Development��������������������������������������������������������������������� 260
15.3.5 Control + Data (Control Flow) Pipelining������������������������������������������������������������������������������ 261
15.4 Summary����������������������������������������������������������������������������������������������������������� 261
15.5 References�������������������������������������������������������������������������������������������������������� 262
■
■Chapter 16: Conclusions and the Future of Big Data Analysis�������������������������� 263
16.1 Conclusions and a Chronology�������������������������������������������������������������������������� 263
16.2 The Current State of Big Data Analysis������������������������������������������������������������� 264
16.3 “Incubating Projects” and “Young Projects”����������������������������������������������������� 267
16.4 Speculations on Future Hadoop and Its Successors����������������������������������������� 268
16.5 A Different Perspective: Current Alternatives to Hadoop����������������������������������� 270
xii
■ Contents
16.6 Use of Machine Learning and Deep Learning Techniques in “Future Hadoop”� 271
16.7 New Frontiers of Data Visualization and BDAs�������������������������������������������������� 272
16.8 Final Words������������������������������������������������������������������������������������������������������� 272
■
■Appendix A: Setting Up the Distributed Analytics Environment������������������������ 275
Overall Installation Plan���������������������������������������������������������������������������������������������������������������������� 275
Set Up the Infrastructure Components����������������������������������������������������������������������������������������������� 278
Basic Example System Setup������������������������������������������������������������������������������������������������������������� 278
Apache Hadoop Setup������������������������������������������������������������������������������������������������������������������������ 280
Install Apache Zookeeper������������������������������������������������������������������������������������������������������������������� 281
Installing Basic Spring Framework Components�������������������������������������������������������������������������������� 283
Basic Apache HBase Setup����������������������������������������������������������������������������������������������������������������� 283
Apache Hive Setup����������������������������������������������������������������������������������������������������������������������������� 283
Additional Hive Troubleshooting Tips�������������������������������������������������������������������������������������������������� 284
Installing Apache Falcon��������������������������������������������������������������������������������������������������������������������� 284
Installing Visualizer Software Components���������������������������������������������������������������������������������������� 284
Installing Gnuplot Support Software��������������������������������������������������������������������������������������������������� 284
Installing Apache Kafka Messaging System��������������������������������������������������������������������������������������� 285
Installing TensorFlow for Distributed Systems����������������������������������������������������������������������������������� 286
Installing JBoss Drools����������������������������������������������������������������������������������������������������������������������� 286
Verifying the Environment Variables��������������������������������������������������������������������������������������������������� 287
References������������������������������������������������������������������������������������������������������������������� 288
■
■Appendix B: Getting, Installing, and Running the Example Analytics System��� 289
Troubleshooting FAQ and Questions Information��������������������������������������������������������� 289
References to Assist in Setting Up Standard Components������������������������������������������� 289
Index��������������������������������������������������������������������������������������������������������������������� 291
xiii
About the Author
Kerry Koitzsch has had more than twenty years of experience in the computer science, image processing,
and software engineering fields, and has worked extensively with Apache Hadoop and Apache Spark
technologies in particular. Kerry specializes in software consulting involving customized big data
applications including distributed search, image analysis, stereo vision, and intelligent image retrieval
systems. Kerry currently works for Kildane Software Technologies, Inc., a robotic systems and image analysis
software provider in Sunnyvale, California.
xv
About the Technical Reviewer
Simin Boschma has over twenty years of experience in computer design engineering. Simin’s experience
also includes program and partner management, as well as developing commercial hardware and software
products at high-tech companies throughout Silicon Valley, including Hewlett-Packard and SanDisk. In
addition, Simin has more than ten years of experience in technical writing, reviewing, and publication
technologies. Simin currently works for Kildane Software Technologies, Inc. in Sunnyvale, CA.
xvii
Acknowledgments
I would like to acknowledge the invaluable help of my editors Celestin Suresh John and Prachi Mehta,
without whom this book would never have been written, as well as the expert assistance of the technical
reviewer Simin Bochma.
xix
Introduction
The Apache Hadoop software library has come into it’s own. It is the basis for advanced distributed
development for a host of companies, government institutions, and scientific research facilities. The
Hadoop ecosystem now contains dozens of components for everything from search, databases, and data
warehousing to image processing, deep learning, and natural language processing. With the advent of
Hadoop 2, different resource managers may be used to provide an even greater level of sophistication and
control than previously possible. Competitors, replacements, as well as successors and mutations of the
Hadoop technologies and architectures abound. These include Apache Flink, Apache Spark, and many
others. The “death of Hadoop” has been announced many times by software experts and commentators.
We have to face the question squarely: is Hadoop dead? It depends on the perceived boundaries of
Hadoop itself. Do we consider Apache Spark, the in-memory successor to Hadoop’s batch file approach, a
part of the Hadoop family simply because it also uses HDFS, the Hadoop file system? Many other examples
of “gray areas” exist in which newer technologies replace or enhance the original “Hadoop classic” features.
Distributed computing is a moving target and the boundaries of Hadoop and its ecosystem have changed
remarkably over a few short years. In this book, we attempt to show some of the diverse and dynamic aspects
of Hadoop and its associated ecosystem, and to try to convince you that, although changing, Hadoop is still
very much alive, relevant to current software development, and particularly interesting to data analytics
programmers.
xxi
PART I
Concepts
The first part of our book describes the basic concepts, structure, and use of the distributed analytics
software system, why it is useful, and some of the necessary tools required to use this type of
distributed system. We will also introduce some of the distributed infrastructure we need to build
systems, including Apache Hadoop and its ecosystem.
CHAPTER 1
This book is about designing and implementing software systems that ingest, analyze, and visualize big data
sets. Throughout the book, we’ll use the acronym BDA or BDAs (big data analytics system) to describe this
kind of software. Big data itself deserves a word of explanation. As computer programmers and architects,
we know that what we now call “big data” has been with us for a very long time—decades, in fact, because
“big data” has always been a relative, multi-dimensional term, a space which is not defined by the mere size
of the data alone. Complexity, speed, veracity—and of course, size and volume of data—are all dimensions
of any modern “big data set”.
In this chapter, we discuss what big data analytic systems (BDAs) using Hadoop are, why they are
important, what data sources, sinks, and repositories may be used, and candidate applications which
are—and are not—suitable for a distributed system approach using Hadoop. We also briefly discuss some
alternatives to the Hadoop/Spark paradigm for building this type of system.
There has always been a sense of urgency in software development, and the development of big data
analytics is no exception. Even in the earliest days of what was to become a burgeoning new industry, big
data analytics have demanded the ability to process and analyze more and more data at a faster rate, and at
a deeper level of understanding. When we examine the practical nuts-and-bolts details of software system
architecting and development, the fundamental requirement to process more and more data in a more
comprehensive way has always been a key objective in abstract computer science and applied computer
technology alike. Again, big data applications and systems are no exception to this rule. This can be no
surprise when we consider how available global data resources have grown explosively over the last few
years, as shown in Figure 1-1.
Figure 1-1. Annual data volume statistics [Cisco VNI Global IP Traffic Forecast 2014–2019]
As a result of the rapid evolution of software components and inexpensive off-the-shelf processing
power, combined with the rapidly increasing pace of software development itself, architects and
programmers desiring to build a BDA for their own application can often feel overwhelmed by the
technological and strategic choices confronting them in the BDA arena. In this introductory chapter, we
will take a high-level overview of the BDA landscape and attempt to pin down some of the technological
questions we need to ask ourselves when building BDAs.
4
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
5
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
■■Note Throughout this book we will keep the emphasis on free, third-party components such as the Apache
components and libraries mentioned earlier. This doesn’t mean you can’t integrate your favorite graph database
(or relational database, for that matter) as a data source into your BDAs. We will also emphasize the flexibility
and modularity of the open source components, which allow you to hook data pipeline components together
with a minimum of additional software “glue.” In our discussion we will use the Spring Data component of the
Spring Framework, as well as Apache Camel, to provide the integrating “glue” support to link our components.
6
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
1
One of the best introductions to the “semantic web” approach is Dean Allemang and Jim Hendler’s “Semantic Web for
the Working Ontologist: Effective Modeling in RDFS and OWL”, 2008, Morgan-Kaufmann/Elsevier Publishing,
Burlington, MA. ISBN 978-0-12-373556-0.
7
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
8
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
1.
Identify requirements for the BDA system. The initial phase of development
requires generating an outline of the technologies, resources, techniques and
strategies, and other components necessary to achieve the objectives. The initial
set of objectives (subject to change of course) need to be pinned down, ordered,
and well-defined. It’s understood that the objectives and other requirements
are subject to change as more is learned about the project’s needs. BDA systems
have special requirements (which might include what’s in your Hadoop cluster,
special data sources, user interface, reporting, and dashboard requirements).
Make a list of data source types, data sink types, necessary parsing,
transformation, validation, and data security concerns. Being able to adapt
your requirements to the plastic and changeable nature of BDA technologies
will insure you can modify your system in a modular, organized way. Identify
computations and processes in the components, determine whether batch or
stream processing (or both) is required, and draw a flowchart of the computation
engine. This will help define and understand the “business logic” of the system.
9
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
2.
Define the initial technology stack. The initial technology stack will include a
Hadoop Core as well as appropriate ecosystem components appropriate for the
requirements you defined in the last step. You may include Apache Spark if you
require streaming support, or you’re using one of the machine learning libraries
based on Spark we discuss later in the book. Keep in mind the programming
languages you will use. If you are using Hadoop, the Java language will be part
of the stack. If you are using Apache Spark, the Scala language will also be used.
Python has a number of very interesting special applications, as we will discuss
in a later chapter. Other language bindings may be used if they are part of the
requirements.
3.
Define data sources, input and output data formats, and data cleansing
processes. In the requirement-gathering phase (step 0), you made an initial
list of the data source/sink types and made a top-level flowchart to help define
your data pipeline. A lot of exotic data sources may be used in a BDA system,
including images, geospatial locations, timestamps, log files, and many others,
so keep a current list of data source (and data sink!) types handy as you do your
initial design work.
4.
Define, gather, and organize initial data sets. You may have initial data for your
project, test and training data (more about training data later in the book), legacy
data from previous systems, or no data at all. Think about the minimum amount
of data sets (number, kind, and volume) and make a plan to procure or generate
the data you need. Please note that as you add new code, new data sets may
be required in order to perform adequate testing. The initial data sets should
exercise each module of the data pipeline, assuring that end-to-end processing is
performed properly.
5.
Define the computations to be performed. Business logic in its conceptual
form comes from the requirements phase, but what this logic is and how it is
implemented will change over time. In this phase, define inputs, outputs, rules,
and transformations to be performed on your data elements. These definitions
get translated into implementation of the computation engine in step 6.
6.
Preprocess data sets for use by the computation engine. Sometimes the data
sets need preprocessing: validation, security checks, cleansing, conversion to a
more appropriate format for processing, and several other steps. Have a checklist
of preprocessing objectives to be met, and continue to pay attention to these
issues throughout the development cycle, and make necessary modifications as
the development progresses.
7.
Define the computation engine steps; define the result formats. The business
logic, flow, accuracy of results, algorithm and implementation correctness, and
efficiency of the computation engine will always need to be questioned and improved.
8.
Place filtered results in results repositories of data sinks. Data sinks are the
data repositories that hold the final output of our data pipeline. There may be
several steps of filtration or transformation before your output data is ready to
be reported or displayed. The final results of your analysis can be stored in files,
databases, temporary repositories, reports, or whatever the requirements dictate.
Keep in mind user actions from the UI or dashboard may influence the format,
volume, and presentation of the outputs. Some of these interactive results
may need to be persisted back to a data store. Organize a list of requirements
specifically for data output, reporting, presentation, and persistence.
10
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
9.
Define and build output reports, dashboards, and other output displays and
controls. The output displays and reports, which are generated, provide clarity
on the results of all your analytic computations. This component of a BDA system
is typically written, at least in part, in JavaScript and may use sophisticated data
visualization libraries to assist different kinds of dashboards, reports, and other
output displays.
10.
Document, test, refine, and repeat. If necessary, we can go through the steps
again after refining the requirements, stack, algorithms, data sets, and the rest.
Documentation initially consists of the notes you made throughout the last seven
steps, but needs to be refined and rewritten as the project progresses. Tests need
to be created, refined, and improved throughout each cycle. Incidentally, each
development cycle can be considered a version, iteration, or however you like to
organize your program cycles.
There you have it. A systematic use of this iterative process will enable you to design and build BDA
systems comparable to the ones described in this book.
11
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
The good news about including imagery as a big data source is that it’s not at all as difficult as it once
was. Sophisticated libraries are available to interface with Hadoop and other necessary components, such
as graph databases, or a messaging component such as Apache Kafka. Low-level libraries such as OpenCV
or BoofCV can provide image processing primitives, if necessary. Writing code is compact and easy. For
example, we can write a simple, scrollable image viewer with the following Java class (shown in Listing 1-1).
Listing 1-1. Hello image world: Java code for an image visualizer stub as shown in Figure 1-5
package com.kildane.iabt;
import java.awt.image.RenderedImage;
import java.io.File;
import java.io.IOException;
import javax.media.jai.JAI;
import javax.imageio.ImageIO;
import javax.media.jai.PlanarImage;
import javax.media.jai.widget.ScrollingImagePanel;
import javax.swing.JFrame;
/**
* Hello IABT world!
* The worlds most powerful image processing toolkit (for its size)?
*/
public class App
{
public static void main(String[] args)
{
JAI jai = new JAI();
RenderedImage image = null;
try {
image = ImageIO.read(new File("/Users/kerryk/Documents/SA1_057_62_
hr4.png"));
} catch (IOException e) {
e.printStackTrace();
}
if (image == null){ System.out.println("Sorry, the image was null"); return; }
JFrame f = new JFrame("Image Processing Demo for Pro Hadoop Data Analytics");
ScrollingImagePanel panel = new ScrollingImagePanel(image, 512, 512);
f.add(panel);
f.setSize(512, 512);
f.setVisible(true);
System.out.println("Hello IABT World, version of JAI is: " + JAI.getBuildVersion());
}
}
12
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
Figure 1-5. Sophisticated third-party libraries make it easy to build image visualization components in just a
few lines of code
A simple image viewer is just the beginning of an image BDA system, however. There is low-level image
processing, feature extraction, transformation into an appropriate data representation for analysis, and
finally loading out the results to reports, dashboards, or customized result displays.
We will explore the images as big data (IABD) concept more thoroughly in Chapter 14.
13
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
Modern programmers are used to polyglot systems. Some of the need for a multilingual approach is
out of necessity: writing a dashboard for the Internet is appropriate for a language such as JavaScript, for
example, although one could write a dashboard using Java Swing in stand-alone or even web mode, under
duress. It’s all a matter of what is most effective and efficient for the application at hand. In this book, we
will embrace the polyglot philosophy, essentially using Java for Hadoop-based components, Scala for
Spark-based components, Python and scripting as needed, and JavaScript-based toolkits for the front end,
dashboards, and miscellaneous graphics and plotting examples.
14
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
Apache ZooKeeper (zookeeper.apache.org) is a distributed coordination service for use with a variety
of Hadoop- and Spark-based systems. It features a naming service, group membership, locks and carries
for distributed synchronization, as well as a highly reliable and centralized registry. ZooKeeper has a
hierarchical namespace data model consisting of “znodes.” Apache ZooKeeper is open source and is
supported by an interesting ancillary component called Apache Curator, a client wrapper for ZooKeeper
which is also a rich framework to support ZooKeeper-centric components. We will meet ZooKeeper and
Curator again when setting up a configuration to run the Kafka messaging system.
To use Apache Camel effectively, it's helpful to know about enterprise integration patterns (EIPs). There
are several good books about EIPs and they are especially important for using Apache Camel.2
2
The go-to book on Enterprise Integration Patterns (EIPs) is Gregor Hohpe and Bobby Woolf’s Enterprise Integration
Patterns: Designing, Building, and Deploying Messaging Solutions, 2004, Pearson Education Inc. Boston, MA. ISBN
0-321-20068-3.
15
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
Figure 1-7. A relationship diagram between Hadoop and other Apache search-related components
16
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
The “software architecture” metaphor breaks down because of certain realities about software
development. If you are building a luxury hotel and you suddenly decide you want to add personal spa
rooms or a fireplace to each suite, it’s a problem. It’s difficult to redesign floor plans, or what brand of carpet
to use. There’s a heavy penalty for changing your mind. Occasionally we must break out of the building
metaphor and take a look at what makes software architecture fundamentally different from its metaphor.
Most of this difference has to do with the dynamic and changeable nature of software itself.
Requirements change, data changes, software technologies evolve rapidly. Clients change their minds
about what they need and how they need it. Experienced software engineers take this plastic, pliable nature
of software for granted, and these realities—the fluid nature of software and of data—impact everything
from toolkits to methodologies, particularly the Agile-style methodologies, which assume rapidly changing
requirements almost as a matter of course.
These abstract ideas influence our practical software architecture choices. In a nutshell, when designing
big data analytical systems, standard architectural principles which have stood the test of time still apply. We
can use organizational principles common to any standard Java programming project, for example. We can
use enterprise integration patterns (EIPs) to help organize and integrate disparate components throughout
our project. And we can continue to use traditional n-tier, client-server, or peer-to-peer principles to
organize our systems, if we wish to do so.
As architects, we must also be aware of how distributed systems in general—and Hadoop in particular—
change the equation of practical system building. The architect must take into consideration the patterns
that apply specifically to Hadoop technologies: for example, mapReduce patterns and anti-patterns.
Knowledge is key. So in the next section, we’ll tell you what you need to know in order to build effective
Hadoop BDAs.
17
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
This includes the “classic ecosystem” components such as Hive, Pig, and HBase, as well as glue components
such as Apache Camel, Spring Framework, the Spring Data sub-framework, and Apache Kafka messaging
system. If you are interested in using relational data sources, a knowledge of JDBC and Spring Framework
JDBC as practiced in standard Java programming will be helpful. JDBC has made a comeback in components
such as Apache Phoenix (phoenix.apache.org), an interesting combination of relational and Hadoop-based
technologies. Phoenix provides low-latency queries over HBase data, using standard SQL syntax in the
queries. Phoenix is available as a client-embedded JDBC driver, so an HBase cluster may be accessed with
a single line of Java code. Apache Phoenix also provides support for schema definitions, transactions, and
metadata.
■■Note One of the best references for setting up and effectively using Hadoop is the book Pro Apache
Hadoop, second edition, by Jason Venner and Sameer Wadkhar, available from Apress Publishing.
Table 1-3. A sampling of BDA components in and used with the Hadoop Ecosystem
18
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
It’s pretty straightforward to create a dashboard or front-end user interface using these libraries or
similar ones. Most of the advanced JavaScript libraries contain efficient APIs to connect with databases,
RESTful web services, or Java/Scala/Python applications.
19
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
Figure 1-8. Simple data visualization displayed on a world map, using the DevExpress toolkit
Big data analysis with Hadoop is something special. For the Hadoop system architect, Hadoop BDA
provides and allows the leverage of standard, mainstream architectural patterns, anti-patterns, and
strategies. For example, BDAs can be developed using the standard ETL (extract-transform-load) concepts,
as well as the architectural principles for developing analytical systems “within the cloud.” Standard system
modeling techniques still apply, including the “application tier” approach to design.
One example of an application tier design might contain a “service tier” (which provides the
“computational engine” or “business logic” of the application) and a data tier (which stores and regulates
input and output data, as well as data sources and sinks and an output tier accessed by the system user,
which provides content to output devices). This is usually referred to as a “web tier” when content is
supplied to a web browser.
20
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
In this book, we express a lot of our examples in a Mac OS X environment. This is by design. The main
reason we use the Mac environment is that it seemed the best compromise between a Linux/Unix
syntax (which, after all, is where Hadoop lives and breathes) and a development environment on a more
modest scale, where a developer could try out some of the ideas shown here without the need for a
large Hadoop cluster or even more than a single laptop. This doesn’t mean you cannot run Hadoop on a
Windows platform in Cygwin or a similar environment if you wish to do so.
A simple data pipeline is shown in Figure 1-9. In a way, this simple pipeline is the “Hello world” program
when thinking about BDAs. It corresponds to the kind of straightforward mainstream ETL (extract-
transform-load) process familiar to all data analysts. Successive stages of the pipline transform the
previous output contents until the data is emitted to the final data sink or result repository.
21
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
Figure 1-10. A useful IDE for development : Eclipse IDE with Maven and Scala built in
In mainstream application development—most of the time—we only encounter a few basic types of
data sources: relational, various file formats (including raw unstructured text), comma-separated values,
or even images (perhaps streamed data or even something more exotic like the export from a graph
database such as Neo4j). In the world of big data analysis, signals, images, and non-structured data
of many kinds may be used. These may include spatial or GPS information, timestamps from sensors,
and a variety of other data types, metadata, and data formats. In this book, particularly in the examples,
we will expose you to a wide variety of common as well as exotic data formats, and provide hints on
how to do standard ETL operations on the data. When appropriate, we will discuss data validation,
compression, and conversion from one data format into another, as needed.
22
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
Throughout the book, we will describe useful Hadoop ecosystem components, particularly those which
are relevant to the example systems we will be building throughout the rest of this book. These components
are building blocks for our BDAs or Big Data Analysis components, so the book will not be discussing the
component functionality in depth. In the case of standard Hadoop-compatible components like Apache
Lucene, Solr, or Apache Camel or Spring Framework, books and Internet tutorials abound.
We will also not be discussing methodologies (such as iterative or Agile methodologies) in depth,
although these are very important aspects of building big data analytical systems. We hope that the systems
we are discussing here will be useful to you regardless of what methodology style you choose.
In this section we give a thumbnail sketch of how to build the BDA evaluation system. When completed
successfully, this will give you everything you need to evaluate code and examples discussed in the
rest of the book. The individual components have complete installation directions at their respective
web sites.
1. Set up your basic development environment if you have not already done so. This
includes Java 8.0, Maven, and the Eclipse IDE. For the latest installation instructions
for Java, visit oracle.com. Don’t forget to set the appropriate environment variables
accordingly, such as JAVA_HOME. Download and install Maven (maven.apache.
org), and set the M2_HOME environment variable. To make sure Maven has been
installed correctly, type mvn –version on the command line. Also type ‘which mvn’
on the command line to insure the Maven executable is where you think it is.
2. Insure that MySQL is installed. Download the appropriate installation package from
www.mysql.com/downloads. Use the sample schema and data included with this
book to test the functionality. You should be able to run ‘mysql’ and ‘mysqld’.
3. Install the Hadoop Core system. In the examples in this book we use Hadoop
version 2.7.1. If you are on the Mac you can use HomeBrew to install Hadoop, or
download from the web site and install according to directions. Set the HADOOP_
HOME environment variable in your.bash_profile file.
4. Insure that Apache Spark is installed. Experiment with a single-machine cluster by
following the instructions at http://spark.apache.org/docs/latest/spark-
standalone.html#installing-spark-standalone-to-a-cluster. Spark is a key
component for the evaluation system. Make sure the SPARK_HOME environment
variable is set in your.bash_profile file.
23
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
Figure 1-11. Successful installation and run of Apache Spark results in a status page at localhost:8080
To make sure the Spark system is executing correctly, run the program from the
SPARK_HOME directory.
./bin/run-example SparkPi 10
24
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
Figure 1-12. To test your Spark installation, run the Spark Pi estimator program. A console view of some
expected results.
25
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
8. Install Apache Solr (lucene.apache.org/solr). Download the Solr server zip file,
unzip, and follow additional directions from the README file. This configurable
Java-based search component can be seamlessly coupled with Hadoop to provide
sophisticated, scalable, and customizable search capabilities, leveraging Hadoop
and Spark infrastructure.
9. Install the Scala programming languages and Akka. Make sure that you have a
support Scala plug-in in your Eclipse IDE. Make sure Scala and the Scala compiler
are installed correctly by typing ‘scalac –version’ and ‘which scala’ on the
command line.
10. Install Python and IPython. On MacOS systems, Python is already available. You may
wish to install the Anaconda system, which provides Python, interactive Python, and
many useful libraries all as one package.
11. Install H20 (h2o.ai) and Sparkling Water. Once Apache Spark and Akka are installed,
we can install H20 and Sparkling Water components.
12. Install appropriate “glue” components. Spring Framework, Spring Data, Apache Camel,
and Apache Tika should be installed. There are already appropriate dependencies for these
components in the Maven pom.xml shown in Appendix A. You may wish to install some
ancillary components such as SpatialHadoop, distributed Weka for Hadoop, and others.
When you have installed all these components, congratulations. You now have a basic software
environment in which you can thoroughly investigate big data analytics systems (BDAs). Using this basic
system as a starting point, we are ready to explore the individual modules as well as to write some extensions
to the basic BDA functionality provided.
1.16 Summary
In this introductory chapter we looked at the changing landscape of big data, methods to ingest, analyze,
store, visualize, and understand the ever-increasing ocean of big data in which we find ourselves. We learned
that big data sources are varied and numerous, and that these big data sources pose new and challenging
questions for the aspiring big data analyst. One of the major challenges facing the big data analyst today is
making a selection between all the libraries and toolkits, technology stacks, and methodologies available for
big data analytics.
We also took a brief overview of the Hadoop framework, both core components and associated
ecosystem components. In spite of this necessarily brief tour of what Hadoop and its ecosystem can do for
us as data analysts, we then explored the architectures and strategies that are available to us, with a mind
towards designing and implementing effective Hadoop-based analytical systems, or BDAs. These systems
will have the scalability and flexibility to solve a wide spectrum of analytical challenges.
The data analyst has a lot of choices when it comes to selecting big data toolkits, and being able to
navigate through the bewildering list of features in order to come up with an effective overall technology
stack is key to successful development and deployment. We keep it simple (as simple as possible, that
is) by focusing on components which integrate relatively seamlessly with the Hadoop Core and its
ecosystem.
26
Chapter 1 ■ Overview: Building Data Analytic Systems with Hadoop
Throughout this book we will attempt to prove to you that the design and implementation steps
outlined above can result in workable data pipeline architectures and systems suitable for a wide range of
domains and problem areas. Because of the flexibility of the systems discussed, we will be able to “swap out”
modular components as technology changes. We might find that one machine learning or image processing
library is more suitable to use, for example, and we might wish to replace the currently existing application
library with one of these. Having a modular design in the first place allows us the freedom of swapping out
components easily. We’ll see this principle in action when we develop the “image as big data” application
example in a later chapter.
In the next chapter, we will take a quick overview and refresher of two of the most popular languages
for big data analytics—Scala and Python—and explore application examples where these two languages are
particularly useful.
27
CHAPTER 2
This chapter contains a quick review of the Scala and Python programming languages used throughout the
book. The material discussed here is primarily aimed at Java/C++ programmers who need a quick review of
Scala and Python.
■■Note A painless way to install Python is to install the Anaconda Python distribution, available at www.
continuum.io/downloads. Anaconda provides many additional Python libraries for math and analytics,
including support for Hadoop, Weka, R, and many others.
30
Chapter 2 ■ A Scala and Python Refresher
2.2 Review of Scala
This short review of the Scala language consists of five simple code snippets which highlight a variety of
language features that we described in our introductory sections. Scala is particularly interesting because of
built-in language features such as type inference, closures, currying, and more. Scala also has a sophisticated
object system: each value is an object, every operation a method invocation. Scala is also compatible with
Java programs. Modern languages always include support for standard data structures, sets, arrays, and
vectors. Scala is no exception, and because Scala has a very close affinity to Java, all the data structures
familiar to you from Java programming still apply.
■■Note In this book we will be discussing Scala version 2.11.7. Type ‘scala –version’ on the command line
to check your installed version of Scala. You may also check your Scala compiler version by typing ‘scalac –
version’ on the command line.
Listing 2-1. Simple example of a Scala program which can be tried out in the interactive shell
/** An example of a quicksort implementation, this one uses a functional style. */
object Sorter {
def sortRoutine(lst: List[Int]): List[Int] = {
if (lst.length < 2)
lst
else {
val pivel = lst(lst.length / 2)
sortRoutine(lst.filter(_ < pivel)) :::
lst.filter(_ == pivel) :::
sortRoutine(lst.filter(_ > pivel))
}
}
31
Chapter 2 ■ A Scala and Python Refresher
Figure 2-1.
You can easily use Spark in any of the interactive Scala shells, as shown in Listing 2-3.
32
Chapter 2 ■ A Scala and Python Refresher
import java.util.HashMap
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
/**
* Consumes messages from one or more topics in Kafka and does wordcount.
* Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>
* <zkQuorum> is a list of one or more zookeeper servers that make quorum
* <group> is the name of kafka consumer group
* <topics> is a list of one or more kafka topics to consume from
* <numThreads> is the number of threads the kafka consumer should use
*
* Example:
* `$ bin/run-example \
* org.apache.spark.examples.streaming.KafkaWordCount zoo01,zoo02,zoo03 \
* my-consumer-group topic1,topic2 1`
*/
object KafkaWordCount {
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
33
Chapter 2 ■ A Scala and Python Refresher
ssc.start()
ssc.awaitTermination()
}
}
Thread.sleep(1000)
}
}
Lazy evaluation is a “call-by-need” strategy implementable in any of our favorite languages. A simple
example of a lazy evaluation exercise is shown in Listing 2-5.
34
Chapter 2 ■ A Scala and Python Refresher
package probdalazy
object lazyLib {
/**
* Data type of suspended computations. (The name froms from ML.)
*/
abstract class Susp[+A] extends Function0[A]
/**
* Implementation of suspended computations, separated from the
* abstract class so that the type parameter can be invariant.
*/
class SuspImpl[A](lazyValue: => A) extends Susp[A] {
private var maybeValue: Option[A] = None
object lazyEvaluation {
import lazyLib._
35
Chapter 2 ■ A Scala and Python Refresher
2.3 Review of Python
In this section we provide a very succinct overview of the Python programming language. Python is a
particularly useful resource for building BDAs because of its advanced language features and seamless
compatibility with Apache Spark. Like Scala and Java, Python has thorough support for all the usual
data structure types you would expect. There are many advantages to using the Python programming
language for building at least some of the components in a BDA system. Python has become a mainstream
development language in a relatively short amount of time, and part of the reason for this is that it’s an easy
language to learn. The interactive shell allows for quick experimentation and the ability to try out new ideas
in a facile way. Many numerical and scientific libraries exist to support Python, and there are many good
books and online tutorials to learn the language and its support libraries.
■■Note Throughout the book we will be using Python version 2.7.6 and interactive Python (IPython) version
4.0.0. To check the versions of python you have installed, type python –version or ipython –version
respectively on the command line.
36
Chapter 2 ■ A Scala and Python Refresher
■■Note To run database connectivity examples, please keep in mind we are primarily using the MySQL
database from Oracle. This means you must download and install the MySQL connector for Python from the
Oracle web site, which is located at https://dev.mysql.com/downloads/connector/python/2.1.html The
connector is easy to install. On the Mac, simply double-click on the dmg file and follow the directions. You can
then test connectivity using an interactive Python shell.
A simple example of database connectivity in Python is shown in Listing 2-6. Readers familiar with Java
JDBC constructs will see the similarity. This simple example makes a database connection, then closes it.
Between the two statements the programmer can access the designated database, define tables, and perform
relational queries.
import mysql.connector
Algorithms of all kinds are easily implemented in Python, and there is a wide range of libraries to assist
you. Use of recursion and all the standard programming structures are available. A simple example of a
recursive program is shown in Listing 2-7.
Just as with Java and Scala, it’s easy to include support packages with the Python “import” statement. A
simple example of this is shown in Listing 2-8.
Planning your import lists explicitly is key to keeping a Python program organized and coherent to the
development team and others using the Python code.
import time
size_of_vec = 1000
def pure_python_version():
t1 = time.time()
X = range(size_of_vec)
Y = range(size_of_vec)
Z = []
for i in range(len(X)):
Z.append(X[i] + Y[i])
return time.time() - t1
def numpy_version():
t1 = time.time()
X = np.arange(size_of_vec)
Y = np.arange(size_of_vec)
Z = X + Y
return time.time() - t1
t1 = pure_python_version()
t2 = numpy_version()
print(t1, t2)
print("Pro Data Analytics Numpy in this example, is: " + str(t1/t2) + " faster!")
Pro Data Analytics Hadoop Numpy in this example, is: 7.75 faster!
38
Chapter 2 ■ A Scala and Python Refresher
import os
filename = os.environ.get('PYTHONSTARTUP')
if filename and os.path.isfile(filename):
with open(filename) as fobj:
startup_file = fobj.read()
exec(startup_file)
import site
site.getusersitepackages()
39
Chapter 2 ■ A Scala and Python Refresher
import pdb
import yourmodule
pdb.run (‘yourmodule.test()’)
For profiling Python, Robert Kern’s very useful line profiler (https://pypi.python.org/pypi/line_
profiler/1.0b3) may be installed by typing the following on the command line:
Why not test your profilers by writing a simple Python program to generate primes, a Fibonacci series,
or some other small routine of your choice?
40
Chapter 2 ■ A Scala and Python Refresher
Figure 2-4. Profiling Python code using memory and line profilers
2.4.2 Documentation of Python
When documenting Python code, its very helpful to take a look at the documentation style guide from
python.org. This can be found at
https://docs.python.org/devguide/documenting.html.
41
Chapter 2 ■ A Scala and Python Refresher
2.6 Summary
In this chapter, we reviewed the Scala and Python programming languages, and compared them with Java.
Hadoop is a Java-centric framework while Apache Spark is written in Scala. Most commonly used BDA
components typically have language bindings for Java, Scala, and Python, and we discussed some of these
components at a high level.
Each of the languages has particular strengths and we were able to touch on some of the appropriate
use cases for Java, Scala, and Python.
We reviewed ways to troubleshoot, debug, profile, and document BDA systems, regardless of what
language we’re coding the BDAs in, and we discussed a variety of plug-ins available for the Eclipse IDE to
work with Python and Scala.
In the next chapter, we will be looking at the necessary ingredients for BDA development: the
frameworks and libraries necessary to build BDAs using Hadoop and Spark.
2.7 References
Bowles, Michael. Machine Learning in Python: Essential Techniques for Predictive Analysis. Indianapolis,
IN : John Wiley and Sons, Inc., 2015.
Hurwitz, Judith S., Kaufman, Marcia, Bowles, Adrian. Cognitive Computing and Big Data Analytics.
Indianapolis, IN: John Wiley and Sons, Inc., 2015.
Odersky, Martin, Spoon, Lex, and Venners, Bill. Programming in Scala, Second Edition. Walnut Creek,
CA: Artima Press, 2014.
Younker, Jeff. Foundations of Agile Python Development. New York, NY: Apress/Springer-Verlag
New York, 2008.
Ziade, Tarek. Expert Python Programming. Birmingham, UK., PACKT Publishing, 2008.
42
CHAPTER 3
In this chapter, we take a look at the necessary ingredients for a BDA system: the standard libraries and
toolkits most useful for building BDAs. We describe an example system (which we develop throughout the
remainder of the book) using standard toolkits from the Hadoop and Spark ecosystems. We also use other
analytical toolkits, such as R and Weka, with mainstream development components such as Ant, Maven,
npm, pip, Bower, and other system building tools. “Glueware components” such as Apache Camel, Spring
Framework, Spring Data, Apache Kafka, Apache Tika, and others can be used to create a Hadoop-based
system appropriate for a variety of applications.
■■Note A successful installation of Hadoop and its associated components is key to evaluating the
examples in this book. Doing the Hadoop installation on the Mac in a relatively painless way is described in
http://amodernstory.com/2014/09/23/installing-hadoop-on-mac-osx-yosemite/ in a post titled
“Installing Hadoop on the Mac Part I.”
Figure 3-1. A whole spectrum of distributed techniques are available for building BDAs
One of the easiest ways to build a modular BDA system is to use Apache Maven to manage the
dependencies and do most of the simple component management for you. Setting up a simple Maven
pom.xml file and creating a simple project in Eclipse IDE is a good way to get the evaluation system going.
We can start with a simple Maven pom.xml similar to the one shown in Listing 2-1. Please note the only
dependencies shown are for the Hadoop Core and Apache Mahout, the machine learning toolkit for Hadoop
we discussed in Chapter 1, which we use frequently in the examples. We will extend the Maven pom file to
include all the ancillary toolkits we use later in the book. You can add or subtract components as you wish,
simply by removing dependencies from the pom.xml file.
Keep in mind that for every technique shown in the diagram, there are several alternatives. For each
choice in the technology stack, there are usually convenient Maven dependencies you can add to your
evaluation system to check out the functionality, so it’s easy to mix and match components. Including the
right “glueware” components can make integration of different libraries less painful.
44
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
■■Note The following important environment variables need to be set to use the book examples effectively:
export BDA_HOME="/Users/kerryk/workspace/bdt"
The easiest way to build a modular BDA system is to use Apache Maven to manage the dependencies
and do most of the simple component management for you. Using a simple pom.xml to get your BDA
project started is a good way to experiment with modules, lock in your technology stack, and define system
functionality—gradually modifying your dependencies and plug-ins as necessary.
45
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
Setting up a simple Maven pom.xml file and creating a simple project in Eclipse IDE is an easy way to
get the evaluation system going. We can start with a simple Maven pom.xml similar to the one shown in
Listing 3-1. Please note the only dependencies shown are for the Hadoop Core and Apache Mahout, the
machine learning toolkit for Hadoop we discussed in Chapter 1, which we use frequently in the examples.
We will extend the Maven pom file to include all the ancillary toolkits we use later in the book. You can add
or subtract components as you wish, simply by removing dependencies from the pom.xml file.
Let's add a rule system to the evaluation system by way of an example. Simply add the appropriate
dependencies for the Drools rule system (Google “drools maven dependencies” for most up to date versions
of Drools). The complete pom.xml file (building upon our original) is shown in Listing 3-2. We will be
leveraging the functionality of JBoss Drools in a complete analytical engine example in Chapter 8. Please
note that we supply dependencies to connect the Drools system with Apache Camel as well as Spring
Framework for Drools.
Listing 3-2. Add JBoss Drools dependencies to add rule-based support to your analytical engine. A complete
example of a Drools use case is in Chapter 8!
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/
XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/
maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.kildane</groupId>
<artifactId>bdt</artifactId>
<packaging>war</packaging>
<version>0.0.1-SNAPSHOT</version>
<name>Big Data Toolkit (BDT) Application, with JBoss Drools Component</name>
<url>http://maven.apache.org</url>
<properties>
<hadoop.version>0.20.2</hadoop.version>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<!-- add these five dependencies to your BDA project to achieve rule-based support -->
<dependency>
<groupId>org.drools</groupId>
<artifactId>drools-core</artifactId>
<version>6.3.0.Final</version>
</dependency>
<dependency>
<groupId>org.drools</groupId>
<artifactId>drools-persistence-jpa</artifactId>
<version>6.3.0.Final</version>
</dependency>
46
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
<dependency>
<groupId>org.drools</groupId>
<artifactId>drools-spring</artifactId>
<version>6.0.0.Beta2</version>
</dependency>
<dependency>
<groupId>org.drools</groupId>
<artifactId>drools-camel</artifactId>
<version>6.0.0.Beta2</version>
</dependency>
<dependency>
<groupId>org.drools</groupId>
<artifactId>drools-jsr94</artifactId>
<version>6.3.0.Final</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-core</artifactId>
<version>0.9</version>
</dependency>
</dependencies>
<build>
<finalName>BDT</finalName>
</build>
</project>
cd $DL4J_HOME directory
Then:
47
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
You will see textual output similar to that in Listing y.y. if the component is running successfully.
To use the deeplearning4j component in our evaluation system, we will now require the most
extensive changes to our BDA pom file to date. The complete file is shown in Listing 3-4.
48
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
<hadoop.version>0.20.2</hadoop.version>
<mahout.version>0.9</mahout.version>
</properties>
<!-- distribution management for dl4j -->
<distributionManagement>
<snapshotRepository>
<id>sonatype-nexus-snapshots</id>
<name>Sonatype Nexus snapshot repository</name>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
</snapshotRepository>
<repository>
<id>nexus-releases</id>
<name>Nexus Release Repository</name>
<url>http://oss.sonatype.org/service/local/staging/deploy/maven2/</url>
</repository>
</distributionManagement>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-jcublas-7.5</artifactId>
<version>${nd4j.version}</version>
</dependency>
</dependencies>
</dependencyManagement>
<repositories>
<repository>
<id>pentaho-releases</id>
<url>http://repository.pentaho.org/artifactory/repo/</url>
</repository>
</repositories>
<dependencies>
<!-- dependencies for dl4j components -->
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-nlp</artifactId>
<version>${dl4j.version}</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-core</artifactId>
<version>${dl4j.version}</version>
</dependency>
49
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-x86</artifactId>
<version>${nd4j.version}</version>
</dependency>
<dependency>
<groupId>org.jblas</groupId>
<artifactId>jblas</artifactId>
<version>1.2.4</version>
</dependency>
<dependency>
<artifactId>canova-nd4j-image</artifactId>
<groupId>org.nd4j</groupId>
<version>${canova.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-yaml</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solandra</artifactId>
<version>UNKNOWN</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>pentaho</groupId>
<artifactId>mondrian</artifactId>
<version>3.6.0</version>
</dependency>
<!-- add these five dependencies to your BDA project to achieve rule-based
support -->
<dependency>
<groupId>org.drools</groupId>
<artifactId>drools-core</artifactId>
<version>6.3.0.Final</version>
</dependency>
50
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
<dependency>
<groupId>org.drools</groupId>
<artifactId>drools-persistence-jpa</artifactId>
<version>6.3.0.Final</version>
</dependency>
<dependency>
<groupId>org.drools</groupId>
<artifactId>drools-spring</artifactId>
<version>6.0.0.Beta2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.5.1</version>
</dependency>
<dependency>
<groupId>org.drools</groupId>
<artifactId>drools-camel</artifactId>
<version>6.0.0.Beta2</version>
</dependency>
<dependency>
<groupId>org.drools</groupId>
<artifactId>drools-jsr94</artifactId>
<version>6.3.0.Final</version>
</dependency>
<dependency>
<groupId>com.github.johnlangford</groupId>
<artifactId>vw-jni</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-core</artifactId>
<version>${mahout.version}</version>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-math</artifactId>
<version>0.11.0</version>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-hdfs</artifactId>
<version>0.11.0</version>
</dependency>
</dependencies>
<build>
<finalName>BDT</finalName>
<plugins>
<plugin>
51
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>1.4.0</version>
<executions>
<execution>
<goals>
<goal>exec</goal>
</goals>
</execution>
</executions>
<configuration>
<executable>java</executable>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>1.6</version>
<configuration>
<createDependencyReducedPom>true</
createDependencyReducedPom>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>org/
datanucleus/**</exclude>
<exclude>META-INF/*.SF</
exclude>
<exclude>META-INF/*.DSA</
exclude>
<exclude>META-INF/*.RSA</
exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer
implementation="org.
apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.
conf</resource>
</transformer>
52
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
<transformer
implementation="org.
apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer
implementation="org.
apache.maven.plugins.shade.resource.ManifestResourceTransformer">
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
After augmenting your BDA evaluation project to use this pom.xml, perform the maven clean, install,
and package tasks to insure your project compiles correctly.
53
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
<repository>
<id>pentaho-releases</id>
<url>http://repository.pentaho.org/artifactory/repo/</url>
</repository>
<dependency>
<groupId>pentaho</groupId>
<artifactId>mondrian</artifactId>
<version>3.6.0</version>
</dependency>
Apache Kylin provides an ANSI SQL interface and multi-dimensional analysis, leveraging Hadoop
functionalities. Business intelligence tools such as Tableau (get.tableau.com) are supported by Apache
Kylin as well.
We will be developing a complete analytical engine example using Apache Kylin to provide OLAP
functionality in Chapter 9.
54
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
For a good discussion of Vowpal-Wabbit, and how to set up and run VW correctly, see http://zinkov.
com/posts/2013-08-13-vowpal-tutorial/.
To install the VW system, you may need to install the boost system first.
On Mac OS, type the following three commands (re-chmod your /usr/local/lib afterwards if you wish):
You may also want to investigate the very interesting web interface to VW, available at https://github.
com/eHarmony/vw-webservice. To install:
55
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
56
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
Once you have the sparkling-water package installed successfully, you can use the Sparkling shell as
shown in Figure 3-4 as your Scala shell. It already has some convenient hooks into Apache Spark for your
convenience.
■■Note Apache Streaming is actively under development. The information about Spark Streaming is
constantly subject to change. Refer to http://spark.apache.org/docs/latest/streaming-programming-
guide.html in order to get the latest information on Apache Streaming. In this book, we primarily refer to the
Spark 1.5.1 version.
To add Spark Streaming to your Java project, add this dependency to your pom.xml file (get the most
recent version parameter to use from the Spark web site):
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.5.1</version>
</dependency>
A simplified diagram of the Spark Streaming system is shown in Figure 3-3. Input data streams are
processed through the Spark engine and emitted as batches of processed data.
57
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
58
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
Figure 3-4. Running the Sparkling Water shell to test your installation
2.
cd to the Solandra directory, and create the JAR file with Maven:
cd Solandra
mvn -DskipTests clean install package
3.
Add the JAR file to your local Maven repository, because there isn't a standard
Maven dependency for Solandra yet:
59
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
4.
Modify your BDA system pom.xml file and add the Solandra dependency:
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solandra</artifactId>
<version>UNKNOWN</version>
</dependency>
5.
Test your new BDA pom.xml:
cd $BDA_HOME
mvn clean install package
In this section, we will discuss in detail how to set up and use the Apache Kafka messaging system, an
important component of our example BDA framework.
At this stage, the result will be ProHadoopBDA0, the name of the topic you defined in step 5.
8. Send some messages from the console to test the messaging sending functionality. Type:
bin/kafka-console-producer.sh –broker-list localhost:9092 –topic ProHadoopBDA0
60
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
class TestStringMethods(unittest.TestCase):
def test_upper(self):
self.assertEqual('foo'.upper(), 'FOO')
def test_isupper(self):
self.assertTrue('FOO'.isupper())
self.assertFalse('Foo'.isupper())
def test_split(self):
s = 'hello world'
self.assertEqual(s.split(), ['hello', 'world'])
# check that s.split fails when the separator is not a string
with self.assertRaises(TypeError):
s.split(2)
if __name__ == '__main__':
unittest.main()
For testing, throughout the book we will use test data sets from http://archive.ics.uci.edu/
ml/machine-learning-databases/ as well as the database from Universita de Bologna at http://www.
dm.unibo.it/~simoncin/DATA.html. For Python testing, we will be using PyUnit (a Python-based version
of the Java unit testing JUnit framework) and pytest (pytest.org), an alternative Python test framework.
A simple example of the Python testing component is shown in Listing 3-5.
Figure 3-5. An architecture diagram for the “Sparkling Water” Spark + H20 System
61
Chapter 3 ■ Standard Toolkits for Hadoop and Analytics
3.11 Summary
In this chapter, we used the first cut of an extensible example system to help motivate our discussion about
standard libraries for Hadoop- and Spark-based big data analytics. We also learned that while there are
innumerable libraries, frameworks, and toolkits for a wide range of distributed analytic domains, all these
components may be tamed by careful use of a good development environment. We chose the Eclipse IDE,
Scala and Python plug-in support, and use of the Maven, npm, easy_install, and pip build tools to make our
lives easier and to help organize our development process. Using the Maven system alone, we were able to
integrate a large number of tools into a simple but powerful image processing module possessing many of
the fundamental characteristics of a good BDA data pipelining application.
Throughout this chapter, we have repeatedly returned to our theme of a modular design, showing how
a variety of data pipeline systems may be defined and built using the standard ten-step process we discussed
in Chapter 1. We also learned about the categories of libraries that are available to help us, including math,
statistical, machine learning, image processing, and many others. We discussed in detail how to install
and use the Apache Kafka messaging system, an important component we use in our example systems
throughout the rest of the book.
There are many language bindings available for these big data Hadoop packages, but we confined
our discussion to the Java, Scala, and Python programming languages. You are free to use other language
bindings when and if your application demands it.
We did not neglect testing and documentation of our example system. While these components are
often seen as “necessary evils,” “add-ons,” “frills,” or “unnecessary,” unit and integration testing remain
key components of any successful distributed system. We discussed MRUnit and Apache Bigtop as viable
testing tools to evaluate BDA systems. Effective testing and documentation lead to effective profiling and
optimization, as well as overall system improvement in many other ways.
We not only learned about Hadoop-centric BDA construction using Apache Mahout, but also about
using Apache Spark as a fundamental building block, using PySpark, MLlib, H20, and Sparkling Water
libraries. Spark technologies for machine learning and BDA construction are now mature and useful ways to
leverage powerful machine learning, cognitive computing, and natural language processing libraries to build
and extend your own BDA systems.
3.12 References
Giacomelli, Piero. Apache Mahout Cookbook. Birmingham, UK., PACKT Publishing, 2013.
Gupta, Ashish. Learning Apache Mahout Classification. Birmingham, UK., PACKT Publishing, 2015.
Grainger, Trey, and Potter, Timothy. Solr in Action. Shelter Island, NY: Manning Publications, 2014.
Guller, Mohammed. Big Data Analytics with Spark: A Practioner’s Guide to Using Spark for Large Scale
Data Analysis. Apress/Springer Science+Business Media New York, 2015.
McCandless, Michael, Hatcher, Erik, and Gospodnetic, Otis. Lucene in Action, Second Edition.
Shelter Island, NY: Manning Publications, 2010.
Owen, Sean, Anil, Robert, Dunning, Ted, and Friedman, Ellen. Mahout in Action. Shelter Island,
NY: Manning Publications, 2011.
Turnbull, Doug, and Berryman, John. Relevant Search: With Applications for Solr and Elasticsearch.
Shelter Island, NY: Manning Publications, 2016.
62
CHAPTER 4
In this chapter, we describe the role of databases in distributed big data analysis. Database types include
relational databases, document databases, graph databases, and others, which may be used as data sources
or sinks in our analytical pipelines. Most of these database types integrate well with Hadoop ecosystem
components, as well as with Apache Spark. Connectivity between different kinds of database and Hadoop/
Apache Spark-distributed processing may be provided by “glueware” such as Spring Data or Apache
Camel. We describe relational databases, such as MySQL, NoSQL databases such as Cassandra, and graph
databases such as Neo4j, and how to integrate them with the Hadoop ecosystem.
There is a spectrum of database types available for you to use, as shown in Figure 4-1. These include flat
files (even a CSV file is a kind of database), relational databases such as MySQL and Oracle, key value data
stores such as Redis, columnar databases such as HBase (part of the Hadoop ecosystem), as well as more
exotic database types such as graph databases (including Neo4J, GraphX, and Giraph)
We can “abstract out” the concept of different database types as generic data sources, and come up
with a common API to connect with, process, and output the content of these data sources. This lets us
use different kinds of databases, as needed, in a flexible way. Sometimes it’s necessary to adopt a “plug
and play” approach for evaluation purposes or to construct proof-of-concept systems. In these instances,
it can be convenient to use a NoSQL database such as MongoDB, and compare performance with a
Cassandra database or even a graph database component. After evaluation, select the right database for your
requirements. Using the appropriate glueware for this purpose, whether it be Apache Camel, Spring Data, or
Spring Integration, is key to building a modular system that can be changed rapidly. Much of the glueware
code can remain the same, or similar to, the existing code base. Minimum re-work is required if the glueware
is selected appropriately.
All database types shown above can be used as distributed system data sources, including relational
databases such as MySQL or Oracle. A typical ETL-based processing flow implemented using a relational
data source might look like the dataflow shown in Figure 4-2.
1.
Cycle Start. The start of the processing cycle is an entry part for the whole
system’s operation. It’s a point of reference for where to start scheduling the
processing task, and a place to return to if the system has to undergo a reboot.
2.
Reference Data Building. “Reference data” refers to the valid types of data which
may be used in individual table fields or the “value” part of key-value pairs.
3.
Source Extraction. Retrieve data from the original data sources and do any
necessary preprocessing of the data. This might be a preliminary data cleansing
or formatting step.
4.
Validation Phase. The data is evaluated for consistency.
5.
Data Transformation. “Business logic” operations are performed on the data sets
to produce an intermediate result.
6.
Load into staging tables/data caches or repositories, if used. Staging tables are
an intermediate data storage area, which may also be a cache or document
database.
7.
Report auditing (for business rule compliance, or diagnosis/repair stage).
Compute and format report results, export to a displayable format (which may
be anything from CSV files to web pages to elaborate interactive dashboard
displays). Other forms of report may indicate efficiency of the data process,
timings and performance data, system health data, and the like. These ancillary
reports support the main reporting task, which is to coherently communicate the
results of the data analytics operations on the original data source contents.
8.
Publishing to target tables/ repositories. The results so far are exported to the
designated output tables or data repositories, which may take a variety of forms
including key/value caches, document databases, or even graph databases.
9.
Archive back up data. Having a backup strategy is just as important for graph
data as traditional data. Replication, validation, and efficient recovery is a must.
10.
Log Cycle Status and Errors. We can make use of standard logging constructs,
even at the level of Log4j in the Java code, or we may wish to use more
sophisticated error logging and reporting if necessary.
Repeat as needed. You can elaborate the individual steps, or specialize to your individual domain
problems as required.
64
Chapter 4 ■ Relational, NoSQL, and Graph Databases
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>gremlin-core</artifactId>
<version>3.2.0-incubating</version>
</dependency>
Once the dependency is in place in your Java project, you may program to the Java API as shown in
Listings 4-1 and 4-2. See the online documentation at: https://neo4j.com/developer/cypher-query-
language/ and http://tinkerpop.incubator.apache.org for more information.
4.2 Examples in Cypher
To create a node in Cypher:
RETURN kerry
65
Chapter 4 ■ Relational, NoSQL, and Graph Databases
4.3 Examples in Gremlin
The Gremlin graph query language is an alternative to Cypher.
Add a new vertex in the graph
g.addVertex([firstName:'Kerry',lastName:'Koitzsch',age:'50']); g.commit();
This will require multiple statements. Note how the variables (jdoe and mj) are defined just by assigning
them a value from a Gremlin query.
g.addEdge(g.v(1),g.v(2),'coworker'); g.commit();
g.V.each{g.removeVertex(it)}
g.commit();
g.E.each{g.removeEdge(it)}
g.commit();
g.V('firstName','Kerry').each{g.removeVertex(it)}
g.commit();
g.removeVertex(g.v(1));
g.commit();
g.removeEdge(g.e(1));
g.commit();
This is to index the graph with a specific field you may want to search frequently. For example, "myfield"
g.createKeyIndex("frequentSearch",Vertex.class);
66
Chapter 4 ■ Relational, NoSQL, and Graph Databases
Graphs may also be constructed using the Java API for TinkerPop. In these examples, we will be using
the cutting edge version (3-incubating) at the time this book was written.
For a thorough discussion of the TinkerPop system, please see http://tinkerpop.apache.org.
For the purposes of managing data, reference data consists of value sets, or status codes or classification
schemas: these are the data objects appropriate for transactions. If we imagine making an ATM withdrawal
transaction, for example, we can imagine the associated status codes for such a transaction, such as
“Succeeded (S),” “Canceled (CN),” “Funds Not Available (FNA),” “Card Cancelled (CC),” etc.
Reference data is generally uniform, company-wide, and can be either created within a country or
by external standardization bodies. Some types of reference data, such as currencies and currency codes,
are always standardized. Others, such as the positions of employees within an organization, are less
standardized.
Master data and associated transactional data are grouped together as part of transactional records.
Reference data is usually highly standardized, either within the company itself, or by a standardization
code supplied by external authorities set up for the purposes of standardization.
Data objects which are relevant to transaction processes are referred to as reference data. These objects
may be classification schemas, value sets, or status objects.
Logging cycle status and errors can be as simple as setting the “log levels” in the Java components of the
programming and letting the program-based logging do the rest, or the construction of whole systems to do
sophisticated logging, monitoring, alerts, and custom reporting. In most cases it is not enough to trust the
Java logs alone, of course.
A simple graph database application based on the model-view-controller (MVC) pattern is shown in
Figure 4-3. The graph query language can be either Cypher or Gremlin, two graph query languages that we
discussed earlier in the chapter.
67
Chapter 4 ■ Relational, NoSQL, and Graph Databases
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-neo4j</artifactId>
<version>4.1.1.RELEASE</version>
</dependency>
Be sure to remember to supply the correct version number, or make it one of the properties in your
pom.xml <properties> tag.
Graph databases can be useful for a number of purposes in a Hadoop-centric system. They can be
intermediate result repositories, hold the final results from a computation, or even provide some relatively
simple visualization capabilities “out of the box” for dashboarding components, as shown in Figure 4-4.
68
Chapter 4 ■ Relational, NoSQL, and Graph Databases
Let’s try a simple load-and-display Neo4j program to get started. The program uses the standard pom.
xml included with the “Big Data Analytics Toolkit” software included with this book: This pom.xml includes
the necessary dependencies to run our program, which is shown in Listing 4-1.
69
Chapter 4 ■ Relational, NoSQL, and Graph Databases
70
Chapter 4 ■ Relational, NoSQL, and Graph Databases
As with most of the components we discuss in this book, Apache Lens is easy to install. Download the
most recent version for the web site (for our version this was http://www.apache.org/dyn/closer.lua/
lens/2.5-beta), expand the zipped TAR file, and run
The LENS system, including the Lens UI component, will build, including the Apache Lens UI as shown
in Figure 4-6.
Log in to Apache Lens by going to the localhost:8784 default Lens web page in any browser. Your login
screen will appear as in Figure 4-8.
Run the Lens REPL by typing:
./lens-cli.sh
You will see a result similar to Figure 4-7. Type ‘help’ in the interactive shell to see a list of OLAP
commands you can try.
71
Chapter 4 ■ Relational, NoSQL, and Graph Databases
Figure 4-8. Apache LENS login page.Use ‘admin’ for default username and ‘admin’ for default password.
72
Chapter 4 ■ Relational, NoSQL, and Graph Databases
And then
mvn verify
Use
bin/zeppelin-daemon.sh start
bin/zeppelin-daemon.sh stop
to stop the Zeppelin server. Run the introductory tutorials to test the use of Zeppelin at https://zeppelin.
apache.org/docs/0.6.0/quickstart/tutorial.html. Zeppelin is particularly useful for interfacing with
Apache Spark applications, as well as NoSQL components such as Apache Cassandra.
73
Chapter 4 ■ Relational, NoSQL, and Graph Databases
OLAP is still alive and well in the Hadoop ecosystem. For example, Apache Kylin (http://kylin.
apache.org) is an open source OLAP engine for use with Hadoop. Apache Kylin supports distributed
analytics, built-in security, and interactive query capabilities, including ANSI SQL support.
Apache Kylin depends on Apache Calcite (http://incubator.apache.org/projects/calcite.html) to
provide an “SQL core.”
To use Apache Calcite, make sure the following dependencies are in your pom.xml file.
<dependency>
<groupId>org.apache.calcite</groupId>
<artifactId>calcite-core</artifactId>
<version>1.7.0</version>
</dependency>
74
Chapter 4 ■ Relational, NoSQL, and Graph Databases
curl -L -O http://search.maven.org/remotecontent?filepath=org/hsqldb/sqltool/2.3.2/sqltool-
2.3.2.jar
and
curl -L -O http://search.maven.org/remotecontent?filepath=org/hsqldb/hsqldb/2.3.2/hsqldb-
2.3.2.jar
on the command line. You should see an installation result similar to Figure 4-13. As you can see, Calcite is
compatible with many of the databases we have been talking about. Components for use with Cassandra,
Spark, and Splunk are available.
75
Chapter 4 ■ Relational, NoSQL, and Graph Databases
4.7 Summary
In this chapter, we discussed a variety of database types, available software libraries, and how to use the
databases in a distributed manner. It should be emphasized that there is a wide spectrum of database
technologies and libraries which can be used with Hadoop and Apache Spark. As we discussed, “glueware”
such as the Spring Data project, Spring Integration, and Apache Camel, are particularly important when
integrating BDA systems with database technologies, as they allow integration of distributed processing
technologies with more mainstream database components. The resulting synergy allows the constructed
system to leverage relational, NoSQL, and graph technologies to assist with implementation of business
logic, data cleansing and validation, reporting, and many other parts of the analytic life cycle.
We talked about two of the most popular graph query languages, Cypher and Gremlin, and looked at
some simple examples of these. We took a look at the Gremlin REPL to perform some simple operations
there.
When talking about graph databases, we focused on the Neo4j graph database because it is an easy-to-
use, full-featured package. Please keep in mind, however, that there are several similar packages which are
equally useful, including Apache Giraph (giraph.apache.org),TitanDB (http://thinkaurelius.github.
io/titan/), OrientDB (http://orientdb.com/orientdb/), and Franz’s AllegroGraph (http://franz.com/
agraph/allegrograph/).
In the next chapter, we will discuss distributed data pipelines in more detail—their structure, necessary
toolkits, and how to design and implement them.
4.8 References
Hohlpe, Gregor, and Woolf, Bobby. Enterprise Integration Patterns: Designing, Building, and Deploying
Messaging Solutions. Boston, MA: Addison-Wesley Publishing, 2004.
Ibsen, Claus, and Strachan, James. Apache Camel in Action. Shelter Island, NY: Manning Publications, 2010.
Martella, Claudio, Logothetis, Dionysios, Shaposhnik, Roman. Practical Graph Analytics with Apache
Giraph. New York: Apress Media, 2015.
Pollack, Mark, Gerke, Oliver, Risberg, Thomas, Brisbin, John, and Hunger, Michael. Spring Data:
Modern Data Access for Enterprise Java. Sebastopol, CA: O’Reilly Media, 2012.
Raj, Sonal. Neo4J High Performance. Birmingham, UK: PACKT Publishing, 2015.
Redmond, Eric, and Wilson, Jim R. Seven Databases in Seven Weeks: A Guide to Modern Databases and
the NoSQL Movement. Raleigh, NC: Pragmatic Programmers, 2012.
Vukotic, Alexa, and Watt, Nicki. Neo4j in Action. Shelter Island, NY: Manning Publication, 2015.
76
CHAPTER 5
In this chapter, we will discuss how to construct basic data pipelines using standard data sources and the
Hadoop ecosystem. We provide an end-to-end example of how data sources may be linked and processed
using Hadoop and other analytical components, and how this is similar to a standard ETL process. We will
develop the ideas presented in this chapter in more detail in Chapter 15.
Since we are going to begin developing the example system in earnest, a note about the package
structure of the example system is not out of place here. The basic package structure of the example
system developed throughout the book is shown in Figure 5-1, and it’s also reproduced in Appendix A.
Let’s examine what the packages contain and what they do briefly before moving on to data pipeline
construction. A brief description of some of the main sub-packages of the Probda system is shown in
Figure 5-2.
78
Chapter 5 ■ Data Pipelines and How to Construct Them
Figure 5-2. Brief description of the packages in the Probda example system
79
Chapter 5 ■ Data Pipelines and How to Construct Them
We can use standard off-the-shelf software components to implement this type of architecture.
We will use Apache Kafka, Beam, Storm, Hadoop, Druid, and Gobblin (formerly Camus) to build our
basic pipeline.
80
Chapter 5 ■ Data Pipelines and How to Construct Them
These basic elements may be used to construct pipelines with many different topologies, like in the
example code in Listing 5-1.
@Test
@Category(RunnableOnService.class)
public void testCountWords() throws Exception {
Pipeline p = TestPipeline.create();
PAssert.that(output).containsInAnyOrder(WORD_COUNT_ARRAY);
p.run().waitUntilFinish();
}
Figure 5-4. Successful Maven build of Apache Beam, showing the reactor summary
81
Chapter 5 ■ Data Pipelines and How to Construct Them
82
Chapter 5 ■ Data Pipelines and How to Construct Them
34.107222
LON-->Numeric
-117.273611
X_COORD-->Numeric
474764.37263
Y_COORD-->Numeric
3774078.43207
DBF files are typically used to represent standard database row-oriented data, such as that shown in Listing 5-3.
A typical method to read DBF files is shown in Listing 5-3.
maps.add(map);
i++;
}
reader.close();
} catch (IOException e){ e.printStackTrace(); }
catch (ParseException pe){ pe.printStackTrace(); }
System.out.println("Read DBF file: " + filename + " , with : " + maps.
size()+ " results...");
return maps
}
83
Chapter 5 ■ Data Pipelines and How to Construct Them
84
Chapter 5 ■ Data Pipelines and How to Construct Them
Figure 5-6. Basic Python ecosystem, with a place for notebook-based visualizers
85
Chapter 5 ■ Data Pipelines and How to Construct Them
Figure 5-7. Initial installer diagram for the Anaconda Python system
86
Chapter 5 ■ Data Pipelines and How to Construct Them
We’ll be looking at several sophisticated visualization toolkits in chapters to come, but for now let us
start out with a quick overview of one of the more popular JavaScript-based toolkits, D3, which can be used
to visualize a wide variety of data sources and presentation types. These include geolocations and maps;
standard pie, line, and bar charting; tabular reports; and many others (custom presentation types, graph
database outputs, and more).
Once Anaconda is working correctly, we can proceed to installing another extremely useful toolkit,
TensorFlow. TensorFlow (https://www.tensorflow.org) is a machine learning library which also contains
support for a variety of “deep learning” techniques.
87
Chapter 5 ■ Data Pipelines and How to Construct Them
88
Chapter 5 ■ Data Pipelines and How to Construct Them
89
Chapter 5 ■ Data Pipelines and How to Construct Them
Figure 5-11. Sophisticated visualizations may be created using the Jupyter visualization feature.
5.7 Summary
In this chapter, we discussed how to build some basic distributed data pipelines as well as an overview of
some of the more effective toolkits, stacks, and strategies to organize and build your data pipeline. Among
these were Apache Tika, Gobblin, Spring Integration, and Apache Flink. We also installed Anaconda (which
makes the Python development environment much easier to set up and use), as well as an important
machine learning library, TensorFlow.
In addition, we took a look at a variety of input and output formats including the ancient but useful
DBF format.
In the next chapter, we will discuss advanced search techniques using Lucene and Solr, and introduce
some interesting newer extensions of Lucene, such as ElasticSearch.
5.8 References
Lewis, N.D. Deep Learning Step by Step with Python. 2016. www.auscov.com
Mattmann, Chris, and Zitting, Jukka. Tika in Action. Shelter Island, NY: Manning Publications, 2012.
Zaccone, Giancarlo. Getting Started with TensorFlow. Birmingham, UK: PACKT Open Source Publishing, 2016.
90
CHAPTER 6
In this chapter, we describe the structure and use of the Apache Lucene and Solr third-party search engine
components, how to use them with Hadoop, and how to develop advanced search capability customized for
an analytical application. We will also investigate some newer Lucene-based search frameworks, primarily
Elasticsearch, a premier search tool particularly well-suited towards building distributed analytic data
pipelines. We will also discuss the extended Lucene/Solr ecosystem and some real-world programming
examples of how to use Lucene and Solr in distributed big data analytics applications.
SolrCloud, a new addition to the Lucene/Solr technology stack, allows multicore processing with a
RESTful interface. To read more about SolrCloud, visit the information page at https://cwiki.apache.org/
confluence/display/solr/SolrCloud.
92
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
In this section we are going to take a brief overview of how to install Hadoop, Lucene/Solr, and NGData’s
Lily project and suggest some “quick start” techniques to get a Lily installation up and running for
development and test purposes.
First, install Hadoop. This is a download, unzip, configure, and run process similar to the many others
you have encountered in this book.
When you have successfully installed and configured Hadoop, and and set up the HDFS file system, you
should be able to execute some simple Hadoop commands such as
hadoop fs –ls /
After executing this, you should see a screen similar to the one in Figure 6-2.
93
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Figure 6-2. Successful test of installation of Hadoop and the Hadoop Distributed File System (HDFS)
Second, install Solr. This is simply a matter of downloading the zip file at, uncompressing, and cd’ing to
the binary file, where you may then start the server immediately, using the command.
A successful installation of Solr can be tested as in Figure 6-3.
94
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Third, download NGDATA’s Lily project from the github project at https://github.com/NGDATA/
lilyproject.
Getting Hadoop, Lucene, Solr, and Lily to cooperate in the same software environment can be tricky, so
we include some tips on setting up the environment that you may have forgotten.
1. Make sure you can log in with ‘ssh’ without password. This is essential for Hadoop
to work correctly. It doesn’t hurt to exercise your Hadoop installation from time to
time, to insure all the moving parts are working correctly. A quick test of Hadoop
functionality can be accomplished on the command line with just a few commands.
For example:
2. Make sure your environment variables are set correctly, and configure your init
files appropriately. This includes such things as your .bash_profile file, if you are on
MacOS, for example.
3. Test component interaction frequently. There are a lot of moving parts in distributed
systems. Perform individual tests to insure each part is working smoothly.
4. Test interaction in standalone, pseudo-distributed, and full-distributed modes when
appropriate. This includes investigating suspicious performance problems, hang-
ups, unexpected stalls and errors, and version incompatibilities.
5. Watch out for version incompatibilities in your pom.xml, and perform good pom.
xml hygiene at all times. Make sure your infrastructure components such as Java,
Maven, Python, npm, Node, and the rest are up-to-date and compatible. Please
note: most of the examples in this book use Java 8 (and some examples rely on
the advanced features present in Java 8), as well as using Maven 3+. Use java –
version and mvn –version when in doubt!
6. Perform “overall optimization” throughout your technology stack. This includes
at the Hadoop, Solr, and data source/sink levels. Identify bottlenecks and
resource problems. Identify “problem hardware,” particularly individual “problem
processors,” if you are running on a small Hadoop cluster.
7. Exercise the multicore functionality in your application frequently. It is rare you will
use a single core in a sophisticated application, so make sure using more than one
core works smoothly.
8. Perform integration testing religiously.
9. Performance monitoring is a must. Use a standard performance monitoring
“script” and evaluate performance based on previous results as well as current
expectations. Upgrade hardware and software as required to improve performance
results, and re-monitor to insure accurate profiling.
10. Do not neglect unit tests. A good introduction to writing unit tests for current versions of
Hadoop can be found at https://wiki.apache.org/hadoop/HowToDevelopUnitTests.
95
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
2. Add the Katta environment variables to your .bash_profile file if you are running
under MacOS, or the appropriate start-up file if running another version of Linux.
These variables include (please note these are examples only; substitute your own
appropriate path values here):
export KATTA_HOME= /Users/kerryk/Downloads/kata-core-0.6.4
and add the binary of Katta to the PATH so you can call it directly:
export PATH=$KATTA_HOME/bin:$PATH
96
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
4. Successfully running the Katta component will produce results similar to those in Figure 6-4.
yourDownLoadDirectory/SacramentocrimeJanuary2006.csv
You will see a screen similar to the one in Figure 6-2 if your core creation is successful.
97
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Modify the schema file schema.xml by adding the right fields to the end of the specification.
............
<!-- you will now add the field specifications for the cdatetime,address,district,beat,gri
d,crimedescr,ucr_ncic_code,latitude,longitude
fields found in the data file SacramentocrimeJanuary2006.csv
-->
<field name="cdatetime" type="string" indexed="true" stored="true" required="true"
multiValued="false" />
<field name="address" type="string" indexed="true" stored="true" required="true"
multiValued="false" />
98
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
<!-- the previous fields were added to the schema.xml file. Field type definition for
currentcy is shown below -->
</schema>
It’s easy to modify data by appending keys and additional data to the individual data lines of the CSV
file. Listing 6-1 is a simple example of such a CSV conversion program.
Modify the Solr data by adding a unique key and creation date to the CSV file.
The program to do this is shown in Listing 6-1. The file name will be com/apress/converter/csv/
CSVConverter.java.
The program to add fields to the CSV data set needs little explanation. It reads an input CSV file line by
line, adding a unique ID and date field to each line of data. There are two helper methods within the class,
createInternalSolrDate() and getCSVField().
Within the CSV data file, the header and the first few rows appear as in Figure 6-7, as shown in Excel.
Figure 6-7. Crime data CSV file. This data will be used throughout this chapter.
99
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.LineNumberReader;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.TimeZone;
import java.util.logging.Logger;
/** Make a date Solr can understand from a regular oracle-style day string.
*
* @param regularJavaDate
* @return
*/
public String createInternalSolrDate(String regularJavaDate){
if (regularJavaDate.equals("")||(regularJavaDate.equals("\"\""))){ return ""; }
/** Get a CSV field in a CSV string by numerical index. Doesnt care if there are
blank fields, but they count in the indices.
*
* @param s
* @param fieldnum
* @return
*/
100
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
public CSVConverter(){
LOGGER.info("Performing CSV conversion for SOLR input");
101
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
e.printStackTrace();
}
for (String line : contents){
String newLine = "";
}
try {
reader.close();
writer.close();
} catch (IOException e){ e.printStackTrace(); }
LOGGER.info("...CSV conversion complete...");
}
javac com/apress/converter/csv/CSVConverter.java
After setting up the CSV conversion program properly as described above, you can run it by typing
Now that we’ve posted the data to the Solr core, we can examine the data set in the Splr dashboard.
Go to localhost:8983 to do this. You should see a screen similar to the one in Figure 6-4.
102
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
103
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Figure 6-9. Result of Solr query, showing the JSON output format
We can also evaluate data from the Solandra core we created earlier in the chapter, as shown in Figure _ _.
Now select the crimedata0 core from the Core Selector drop-down. Click on query and change the
output format (‘wt’ parameter dropdown) to csv , so that you can see several lines of data at once. You will
see a data display similar to the one in Figure 6-9.
104
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Figure 6-10. Result of Solr query using the dashboard (Sacramento crime data core)
Because of Solr’s RESTful interface, we can make queries either through the dashboard (conforming to
Lucene’s query syntax discussed earlier) or on the command line using the CURL utility.
105
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Table 6-2. Feature comparison table of Elasticsearch features vs. Apache Solr features
JSON XML CSV HTTP JMX Client Lucene Self Sharding Visualization Web
REST Libraries Query Contained Admin
Parsing Distributed Interface
Cluster
Logstash (logstash.net) is a useful application to allow importing of a variety of different kinds of data
into Elasticsearch, including CSV-formatted files and ordinary “log format” files. Kibana (https://www.
elastic.co/guide/en/kibana/current/index.html) is an open source visualization component which
allows customizable . Together Elasticsearch, Logstash, and Kibana form the so-called “ELK stack,” which
can be principally used to. In this section, we’ll look at a small example of the ELK stack in action.
106
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Figure 6-11. The so-called “ELK stack”: Elasticsearch, Logstash, and Kibana visualization
107
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Figure 6-12. ELK stack in use: Elasticsearch search engine/pipeline architecture diagram
Installing and trying out the ELK Stack couldn’t be easier. It is a familiar process if you have followed
through the introductory chapters of the book so far. Follow the three steps below to install and test the
ELK stack:
1. Download Elasticsearch from https://www.elastic.co/downloads/
elasticsearch.
cd $ELASTICSEARCH_HOME/bin/
./elasticsearch
108
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Figure 6-13. Successful start-up of the Elasticsearch server from the binary directory
Use the following Java program to import the crime data CSV file (or, with a little
modification, any CSV formatted data file you wish):
public static void main(String[] args)
{
System.out.println( "Import crime data" );
String originalClassPath = System.getProperty("java.class.path");
String[] classPathEntries = originalClassPath.split(";");
StringBuilder esClasspath = new StringBuilder();
for (String entry : classPathEntries) {
if (entry.contains("elasticsearch") || entry.contains("lucene")) {
esClasspath.append(entry);
esClasspath.append(";");
}
}
System.setProperty("java.class.path", esClasspath.toString());
System.setProperty("java.class.path", originalClassPath);
System.setProperty("es.path.home", "/Users/kerryk/Downloads/elasticsearch-2.3.1");
String file = "SacramentocrimeJanuary2006.csv";
Client client = null;
try {
client = TransportClient.builder().build()
.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("localho
st"), 9300));
int numlines = 0;
XContentBuilder builder = null;
int i=0;
109
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
jsonBuilder()
.startObject()
.field("cdatetime", cdatetime)
.field("address", address)
.field("district", district)
.field("beat", beat)
.field("grid", grid)
.field("crimedescr", crimedescr)
.field("ucr_ncic_code", ucrnciccode)
.field("latitude", latitude)
.field("longitude", longitude)
.field("entrydate", new Date())
.endObject())
.execute().actionGet();
} else {
System.out.println("Ignoring first line...");
i++;
}
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
110
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Run the program in Eclipse or in the command line. You will see a result similar to
the one in Figure 6-14. Please note that each row of the CSV is entered as a set
of fields into the Elasticsearch repository. You can also select the index name and
index type by changing the appropriate strings in the code example.
Figure 6-14. Successful test of an Elasticsearch crime database import from the Eclipse IDE
You can test the query capabilities of your new Elasticsearch set-up by using ‘curl’
on the command line to execute some sample queries, such as:
Figure 6-15. You can see the schema update logged in the Elasticsearch console
111
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Figure 6-16. Successful test of an Elasticsearch crime database query from the command line
After entering some text, you will see an echoed result similar to Figure 6-6.
You will also need to set up a configuration file for use with Logstash. Follow the
directions found at to make a configuration file such as the one shown in Listing 6-2.
112
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Figure 6-17. Testing your Logstash installation. You can enter text from the command line.
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}
output {
elasticsearch { hosts => ["localhost:9200"] }
stdout { codec => rubydebug }
}
113
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Figure 6-18. Successful start-up of the Kibana visualization component from its binary directory
114
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
You can easily query for keywords or more complex queries interactively using the
Kibana dashboard as shown in Figure 6-19.
Add this schema for the crime data to Elasticsearch with this cURL command:
Notice the “location” tag in particular, which has a geo_point-type definition. This
allows Kibana to identify the physical location on a map for visualization purposes,
as shown in Figure 6-21.
Figure 6-21 is a good example of understanding a complex data set at a glance. We can immediately
pick out the “high crime” areas in red.
116
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
import java.io.IOException;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import org.elasticsearch.action.delete.DeleteResponse;
import org.elasticsearch.action.get.GetResponse;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.search.SearchType;
import org.elasticsearch.client.Client;
import static org.elasticsearch.index.query.QueryBuilders.fieldQuery;
import org.elasticsearch.node.Node;
import static org.elasticsearch.node.NodeBuilder.nodeBuilder;
import org.elasticsearch.search.SearchHit;
/**
*
* @author kerryk
*/
117
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
node.close();
}
public static Map<String, Object> put(String title, String content, Date postDate,
String[] tags, String author,
String communityName, String
parentCommunityName){
jsonDocument.put("title", title);
jsonDocument.put("content", content);
jsonDocument.put("postDate", postDate);
jsonDocument.put("tags", tags);
jsonDocument.put("author", author);
jsonDocument.put("communityName", communityName);
jsonDocument.put("parentCommunityName", parentCommunityName);
return jsonDocument;
}
public static void get(Client client, String index, String type, String id){
System.out.println("------------------------------");
System.out.println("Index: " + getResponse.getIndex());
118
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
119
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
System.out.println("------------------------------");
Map<String,Object> result = hit.getSource();
System.out.println(result);
}
}
public static void delete(Client client, String index, String type, String id){
Defining the CRUD operations for a search component is key to the overall architecture and logistics of
how the customized component will “fit in” with the rest of the system.
<spring.data.elasticsearch.version>2.0.1.RELEASE</spring.data.elasticsearch.version>
<spring.data.solr.version>2.0.1.RELEASE</spring.data.solr.version>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-elasticsearch</artifactId>
<version>2.0.1.RELEASE</version>
</dependency>
and
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-solr</artifactId>
<version>2.0.1.RELEASE</version>
</dependency>
We can now develop Spring Data-based code examples as shown in Listing 6-5 and Listing 6-6.
120
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
@Configuration
@EnableSolrRepositories(basePackages = { "org.springframework.data.solr.showcase.product" },
multicoreSupport = true)
public class SearchContext {
@Bean
public SolrServer solrServer(@Value("${solr.host}") String solrHost) {
return new HttpSolrServer(solrHost);
}
File: WebContext.java
import java.util.List;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.data.web.PageableHandlerMethodArgumentResolver;
import org.springframework.web.method.support.HandlerMethodArgumentResolver;
import org.springframework.web.servlet.config.annotation.ViewControllerRegistry;
import org.springframework.web.servlet.config.annotation.WebMvcConfigurerAdapter;
/**
* @author kkoitzsch
*/
@Configuration
public class WebContext {
@Bean
121
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
registry.addViewController("/").setViewName("search");
registry.addViewController("/monitor").setViewName("monitor");
}
@Override
public void addArgumentResolvers(List<HandlerMethodArgumentResolver>
argumentResolvers) {
argumentResolvers.add(new
PageableHandlerMethodArgumentResolver());
}
};
}
}
Listing 6-6. Spring Data code example using Elasticsearch (unit test)
package com.apress.probda.search.elasticsearch;
import com.apress.probda.search.elasticsearch .Application;
import com.apress.probda.search.elasticsearch .Post;
import com.apress.probda.search.elasticsearch.Tag;
import com.apress.probda.search.elasticsearch.PostService;
import org.junit.Before;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.SpringApplicationConfiguration;
import org.springframework.data.domain.Page;
import org.springframework.data.domain.PageRequest;
import org.springframework.data.elasticsearch.core.ElasticsearchTemplate;
import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;
122
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
import java.util.Arrays;
import static org.hamcrest.CoreMatchers.notNullValue;
import static org.hamcrest.core.Is.is;
import static org.junit.Assert.assertThat;
@RunWith(SpringJUnit4ClassRunner.class)
@SpringApplicationConfiguration(classes = Application.class)
public class PostServiceImplTest{
@Autowired
private PostService postService;
@Autowired
private ElasticsearchTemplate elasticsearchTemplate;
@Before
public void before() {
elasticsearchTemplate.deleteIndex(Post.class);
elasticsearchTemplate.createIndex(Post.class);
elasticsearchTemplate.putMapping(Post.class);
elasticsearchTemplate.refresh(Post.class, true);
}
//@Test
public void testSave() throws Exception {
Tag tag = new Tag();
tag.setId("1");
tag.setName("tech");
Tag tag2 = new Tag();
tag2.setId("2");
tag2.setName("elasticsearch");
Post post = new Post();
post.setId("1");
post.setTitle("Bigining with spring boot application and elasticsearch");
post.setTags(Arrays.asList(tag, tag2));
postService.save(post);
assertThat(post.getId(), notNullValue());
Post post2 = new Post();
post2.setId("1");
post2.setTitle("Bigining with spring boot application");
post2.setTags(Arrays.asList(tag));
postService.save(post);
assertThat(post2.getId(), notNullValue());
}
public void testFindOne() throws Exception {
}
@Test
public void testFindByTagsName() throws Exception {
Tag tag = new Tag();
tag.setId("1");
123
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
tag.setName("tech");
Tag tag2 = new Tag();
tag2.setId("2");
tag2.setName("elasticsearch");
assertThat(posts.getTotalElements(), is(1L));
assertThat(posts2.getTotalElements(), is(1L));
assertThat(posts3.getTotalElements(), is(0L));
}
}
124
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Figure 6-22. NLP system architecture, using LingPipe, GATE, and NGDATA Lily
Natural language processing systems can be designed and built in a similar fashion to any other
distributed pipelining system. The only difference is the necessary adjustments for the particular nature of
the data and metadata itself. LingPipe, GATE, Vowpal Wabbit, and StanfordNLP allow for the processing,
parsing, and “understanding” of text, and packages such as Emir/Caliph, ImageTerrier, and HIPI provide
features to analyze and index image- and signal-based data. You may also wish to add packages to help
with geolocation, such as SpatialHadoop (http://spatialhadoop.cs.umn.edu), which is discussed in more
detail in Chapter 14.
Various input formats including raw text, XML, HTML, and PDF documents can be processed by GATE,
as well as relational data/JDBC-mediated data. This includes data imported from Oracle, PostgreSQL, and
others.
The Apache Tika import component might be implemented as in Listing 6-7.
125
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Listing 6-7. Apache Tika import routines for use throughout the PROBDA System
Package com.apress.probda.io;
import java.io.*;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Set;
import com.apress.probda.pc.AbstractProbdaKafkaProducer;
import org.apache.commons.lang3.StringUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.serialization.JsonMetadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.isatab.ISArchiveParser;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.dia.kafka.solr.consumer.SolrKafkaConsumer;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.JSONValue;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
/**
* Tag for specifying things coming out of LABKEY
*/
public final static String ISATOOLS_SOURCE_VAL = "ISATOOLS";
/**
* ISA files default prefix
*/
private static final String DEFAULT_ISA_FILE_PREFIX = "s_";
/**
* Json jsonParser to decode TIKA responses
*/
private static JSONParser jsonParser = new JSONParser();
;
/**
* Constructor
*/
public ISAToolsKafkaProducer(String kafkaTopic, String kafkaUrl) {
126
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
initializeKafkaProducer(kafkaTopic, kafkaUrl);
}
/**
* @param args
*/
public static void main(String[] args) throws IOException {
String isaToolsDir = null;
long waitTime = DEFAULT_WAIT;
String kafkaTopic = KAFKA_TOPIC;
String kafkaUrl = KAFKA_URL;
// get KafkaProducer
final ISAToolsKafkaProducer isatProd = new ISAToolsKafkaProducer(kafkaTopic, kafkaUrl);
DirWatcher dw = new DirWatcher(Paths.get(isaToolsDir));
127
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
/**
* Checks for files inside a folder
*
* @param innerFolder
* @return
*/
public static List<String> getFolderFiles(File innerFolder) {
List<String> folderFiles = new ArrayList<String>();
String[] innerFiles = innerFolder.list(new FilenameFilter() {
public boolean accept(File dir, String name) {
if (name.startsWith(DEFAULT_ISA_FILE_PREFIX)) {
return true;
}
return false;
}
});
/**
* Performs the parsing request to Tika
*
* @param files
* @return a list of JSON objects.
*/
public static List<JSONObject> doTikaRequest(List<String> files) {
List<JSONObject> jsonObjs = new ArrayList<JSONObject>();
try {
Parser parser = new ISArchiveParser();
StringWriter strWriter = new StringWriter();
128
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
jsonObjs.add(jsonObject);
strWriter.getBuffer().setLength(0);
}
strWriter.flush();
strWriter.close();
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
return jsonObjs;
}
129
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
invNames.addAll(((JSONArray) JSONValue.parse(entry.getValue().
toString())));
} else if (jsonKey.equals("Study_Person_Mid_Initials")) {
invMid.addAll(((JSONArray) JSONValue.parse(entry.getValue().
toString())));
} else if (jsonKey.equals("Study_Person_Last_Name")) {
invLastNames.addAll(((JSONArray) JSONValue.parse(entry.getValue().
toString())));
}
jsonKey = solrKey;
} else {
jsonKey = jsonKey.replace(" ", "_");
}
jsonObject.put(jsonKey, entry.getValue());
}
/**
* Send message from IsaTools to kafka
*
* @param newISAUpdates
*/
void sendISAToolsUpdates(List<JSONObject> newISAUpdates) {
for (JSONObject row : newISAUpdates) {
row.put(SOURCE_TAG, ISATOOLS_SOURCE_VAL);
this.sendKafka(row.toJSONString());
System.out.format("[%s] New message posted to kafka.\n", this.getClass().
getSimpleName());
}
}
130
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
/**
* Gets the application updates from a directory
*
* @param isaToolsTopDir
* @return
*/
private List<JSONObject> initialFileLoad(String isaToolsTopDir) {
System.out.format("[%s] Checking in %s\n", this.getClass().getSimpleName(),
isaToolsTopDir);
List<JSONObject> jsonParsedResults = new ArrayList<JSONObject>();
List<File> innerFolders = getInnerFolders(isaToolsTopDir);
return jsonParsedResults;
}
/**
* Gets the inner folders inside a folder
*
* @param isaToolsTopDir
* @return
*/
private List<File> getInnerFolders(String isaToolsTopDir) {
List<File> innerFolders = new ArrayList<File>();
File topDir = new File(isaToolsTopDir);
String[] innerFiles = topDir.list();
for (String innerFile : innerFiles) {
File tmpDir = new File(isaToolsTopDir + File.separator + innerFile);
if (tmpDir.isDirectory()) {
innerFolders.add(tmpDir);
}
}
return innerFolders;
}
}
131
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
1. First install LingPipe by downloading the LingPipe release JAR file from http://
alias-i.com/lingpipe/web/download.html. You may also download LingPipe
models that interest you from http://alias-i.com/lingpipe/web/models.html
. Follow the directions so as to place the models in the correct directory so that
LingPipe may pick up the models for the appropriate demos which require them.
2. Download GATE from University of Sheffield web site (https://gate.ac.uk), and
use the installer to install GATE components. The installation dialog is quite easy
to use and allows you to selectively install a variety of components, as shown in
Figure 6-24.
3. We will also introduce the StanfordNLP (http://stanfordnlp.github.io/
CoreNLP/#human-languages-supported) library component for our example.
To get started with Stanford NLP, download the CoreNLP zip file from the GitHub link
above. Expand the zip file.
Make sure the following dependencies are in your pom.xml file:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
<classifier>models</classifier>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-parser</artifactId>
<version>3.5.2</version>
</dependency>
Go to Stanford NLP “home directory” (where the pom.xml file is located) and do
then test the interactive NLP shell to insure correct behavior. Type
./corenlp.sh
to start the interactive NLP shell. Type some sample text into the shell to see the parser in
action. The results shown will be similar to those shown in Figure 6-17.
132
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Two different method signatures for search() are present. One is specifically for the field and query
combination. Query is the Lucene query as a string, and maximumResultCount limits the number of result
elements to a manageable amount.
We can define the implementation of the ProbdaSearchEngine interface as in Listing 6-8.
133
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
Figure 6-24. GATE Installation dialog. GATE is easy to install and use.
Simply click through the installation wizard. Refer to the web site and install all software components
offered.
To use LingPipe and GATE in a program, let’s work through a simple example, as shown in Listing 6-9.
Please refer to some of the references at the end of the chapter to get a more thorough overview of the
features that LingPipe and GATE can provide.
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;
134
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
pipeline.annotate(annotation);
pipeline.prettyPrint(annotation, out);
if (xmlOut != null) {
pipeline.xmlPrint(annotation, xmlOut);
}
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
if (sentences != null && sentences.size() > 0) {
CoreMap sentence = sentences.get(0);
Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
out.println();
out.println("The first sentence parsed is:");
tree.pennPrint(out);
}
}
6.8 Summary
In this chapter, we took a quick overview of the Apache Lucene and Solr ecosystem. Interestingly, although
Hadoop and Solr started out together as part of the Lucene ecosystem, they have since gone their separate
ways and evolved into useful independent frameworks. This doesn’t mean that the Solr and Hadoop
ecosystems cannot work together very effectively, however. Many Apache components, such as Mahout,
LingPipe, GATE, and Stanford NLP, work seamlessly with Lucene and Solr. New technology additions
to Solr, such as SolrCloud and others, make it easier to use RESTful APIs to interface to the Lucene/Solr
technologies.
We worked through a complete example of using Solr and its ecosystem: from downloading, massaging,
and inputting the data set to transforming the data and outputting results in a variety of data formats. It
becomes even more clear that Apache Tika and Spring Data are extremely useful for data pipeline “glue.”
135
Chapter 6 ■ Advanced Search Techniques with Hadoop, Lucene, and Solr
We did not neglect competitors to the Lucene/Solr technology stack. We were able to discuss
Elasticsearch, a strong alternative to Lucene/Solr, and describe some of the pros and cons of using
Elasticsearch over a more “vanilla Lucene/Solr” approach. One of the most interesting parts of Elasticsearch
is the seamless ability to visualize data, as we showed while exploring the crime statistics of Sacramento.
In the next chapter, we will discuss a number of analytic techniques and algorithms which are
particularly useful for building distributed analytical systems, building upon what we’ve learned so far.
6.9 References
Awad, Mariette and Khanna, Rahul. Efficient Learning Machines. New York: Apress Open Publications, 2015.
Babenko, Dmitry and Marmanis,Haralambos. Algorithms of the Intelligent Web. Shelter Island :
Manning Publications, 2009.
Guller, Mohammed. Big Data Analytics with Apache Spark. New York: Apress Press, 2015.
Karambelkar, Hrishikesh. Scaling Big Data with Hadoop and Solr. Birmingham, UK: PACKT Publishing,
2013.
Konchady, Manu. Building Search Applications: Lucene, LingPipe and GATE. Oakton, VA : Mustru
Publishing, 2008.
Mattmann, Chris A. and Zitting, Jukka I. Tika in Action. Shelter Island: Manning Publications, 2012.
Pollack, Mark, Gierke, Oliver, Risberg, Thomas, Brisbin, Jon, Hunger, Michael. Spring Data: Modern
Data Access for Enterprise Java. Sebastopol, CA: O’Reilly Media, 2012.
Venner, Jason. Pro Hadoop. New York NY: Apress Press, 2009.
136
PART II
The second part of our book discusses standard architectures, algorithms, and techniques to build
analytic systems using Hadoop. We also investigate rule-based systems for control, scheduling,
and system orchestration and showcase how a rule-based controller can be a natural adjunct to a
Hadoop-based analytical system.
CHAPTER 7
An Overview of Analytical
Techniques and Algorithms
In this chapter, we provide an overview of four categories of algorithm: statistical, Bayesian, ontology-driven,
and hybrid algorithms which leverage the more basic algorithms found in standard libraries to perform
more in-depth and accurate analyses using Hadoop.
(continued)
Statistical and numerical algorithms are the most straightforward type of distributed algorithm we can use.
Statistical techniques include the use of standard statistic computations such as those shown in
Figure 7-1.
Figure 7-1. Mean, standard deviation, and normal distribution are often used in statistical methods
Bayesian techniques are one of the most effective techniques for building classifiers, data modeling,
and other purposes.
Ontology-driven algorithms, on the other hand, are a whole family of algorithms that rely on logical,
structured, hierarchical modeling, grammars, and other techniques to provide infrastructure for modeling,
data mining, and drawing inferences about data sets.
140
Chapter 7 ■ An Overview of Analytical Techniques and Algorithms
Hybrid algorithms combine one or more modules consisting of different types of algorithm, linked
together with glueware, to provide a more flexible and powerful data pipeline than would be possible
with only a single algorithm type. For example, a neural net technology may be combined with a Bayesian
technology and an ML technology to create “learning Bayesian networks,” a very interesting example of the
synergy that can be obtained by using a hybrid approach.
We can see a Tachyon-centric technology stack in Figure 7-4. Tachyon is a fault tolerant distributed
in-memory file system
141
Chapter 7 ■ An Overview of Analytical Techniques and Algorithms
Figure 7-4. A Tachyon-centric technology stack, showing some of the associated ecosystem
7.3 Bayesian Techniques
The Bayesian techniques we implement in the example system are found in the package com.prodba.
algorithms.bayes.
Some of the Bayesian techniques (besides the naïve Bayes algorithm) supported by our most popular
libraries include the ones shown in Figure 7-1.
The naïve Bayesian classifier is based upon the Fundamental Bayes equation as shown in Figure 7-5.
142
Chapter 7 ■ An Overview of Analytical Techniques and Algorithms
The equation contains four main probability types: posterior probability, likelihood, class prior
probability, and predictor prior probability. These terms are explained in the references at the end of the
chapter.
We can try out the Mahout text classifier in a straightforward way. First, download one of the basic data
sets to test with.
<dependency>
<groupId>edu.stanford.protege</groupId>
<artifactId>protege-common</artifactId>
<version>5.0.0-beta-24</version>
</dependency>
http://protege.stanford.edu/products.php#desktop-protégé.
Ontologies may be defined interactively by using an ontology editor such as Stanford’s Protégé system,
as shown in Figure 7-5.
143
Chapter 7 ■ An Overview of Analytical Techniques and Algorithms
Figure 7-6. Setting up SPARQL functionality with the Stanford toolkit interactive setup
You can safely select all the components, or just the ones you need. Refer to the individual online
documentation pages to see if the components are right for your application.
144
Chapter 7 ■ An Overview of Analytical Techniques and Algorithms
Figure 7-7. Using an ontology editor to define ontologies, taxonomies, and grammars
145
Chapter 7 ■ An Overview of Analytical Techniques and Algorithms
7.6 Code Examples
In this section we discuss some extended examples of the algorithm types we talked about in earlier
sections.
To get a sense of some algorithm comparisons, let’s use the movie dataset to evaluate some of the
algorithms and toolkits we’ve talked about.
package com.apress.probda.datareader.csv;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.OutputStreamWriter;
146
Chapter 7 ■ An Overview of Analytical Techniques and Algorithms
/**
* This routine splits a line which is delimited into fields by the vertical
* bar symbol '|'
*
* @param l
* @return
*/
public static String makeComponentsList(String l) {
String[] genres = l.split("\\|");
StringBuffer sb = new StringBuffer();
for (String g : genres) {
sb.append("\"" + g + "\",");
}
String answer = sb.toString();
return answer.substring(0, answer.length() - 1);
}
/**
* The main routine processes the standard movie data files so that mahout
* can use them.
*
* @param args
*/
public static void main(String[] args) {
if (args.length < 4){
System.out.println("Usage: <movie data input><movie output file><ratings input file>
<ratings output file>");
System.exit(-1);
}
File file = new File(args[0]);
if (!file.exists()) {
System.out.println("File: " + file + " did not exist, exiting...");
System.exit(-1);
}
System.out.println("Processing file: " + file);
BufferedWriter bw = null;
FileOutputStream fos = null;
String line;
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
int i = 1;
File fout = new File(args[1]);
fos = new FileOutputStream(fout);
bw = new BufferedWriter(new OutputStreamWriter(fos));
while ((line = br.readLine()) != null) {
String[] components = line.split("::");
String number = components[0].trim();
String[] titleDate = components[1].split("\\(");
String title = titleDate[0].trim();
String date = titleDate[1].replace(")", "").trim();
147
Chapter 7 ■ An Overview of Analytical Techniques and Algorithms
148
Chapter 7 ■ An Overview of Analytical Techniques and Algorithms
Figure 7-8. Importing a standard movie data set example using a CURL command
Data sets can be imported into Elasticsearch via the command line using a CURL command. Figure 7-8
is the result of executing such a command. The Elasticsearch server returns a JSON data structure which is
displayed on the console as well as being indexed into the Elasticsearch system.
We can see a simple example of using Kibana as a reporting tool in Figure 7-7. Incidentally, we will
encounter Kibana and the ELK Stack (Elasticsearch – Logstash – Kibana) throughout much of the remaining
content in this book. While there are alternatives to using the ELK stack, it is one of the more painless ways to
construct a data analytics system from third-party building blocks.
149
Chapter 7 ■ An Overview of Analytical Techniques and Algorithms
7.7 Summary
In this chapter, we discussed analytical techniques and algorithms and some criteria for evaluating
algorithm effectiveness. We touched on some of the older algorithm types: the statistical and numerical
analytical functions. The combination or hybrid algorithm has become particularly important in recent
days as techniques from machine learning, statistics, and other areas may be used very effectively in
a cooperative way, as we have seen throughout this chapter. For a general introduction to distributed
algorithms, see Barbosa (1996).
Many of these algorithm types are extremely complex. Some of them, for example the Bayesian
techniques, have a whole literature devoted to them. For a thorough explanation of Bayesian techniques in
particular and probabilistic techniques in general, see Zadeh (1992),
In the next chapter, we will discuss rule-based systems, available rule engine systems such as JBoss
Drools, and some of the applications of rule-based systems for smart data collection, rule-based analytics,
and data pipeline control scheduling and orchestration.
7.8 References
Barbosa, Valmir C. An Introduction to Distributed Algorithms. Cambridge, MA: MIT Press, 1996.
Bolstad, William M. Introduction to Bayesian Statistics. New York: Wiley Inter-Science,
Wiley and Sons, 2004.
Giacomelli, Pico. Apache Mahout Cookbook. Birmingham, UK: PACKT Publishing, 2013.
Gupta, Ashish. Learning Apache Mahout Classification. Birmingham, UK: PACKT Publishing, 2015.
Marmanis, Haralambos and Babenko, Dmitry. Algorithms of the Intelligent Web. Greenwich, CT:
Manning Publications, 2009.
Nakhaeizadeh, G. and Taylor, C.C. (eds). Machine Learning and Statistics: The Interface.
New York: John Wiley and Sons, Inc., 1997.
Parsons, Simon. Qualitative Methods for Reasoning Under Uncertainty. Cambridge. MA: MIT Press,
2001.
Pearl, Judea. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo,
CA: Morgan-Kaufmann Publishers, Inc., 1988.
Zadeh, Lofti A., and Kacprzyk, (eds). Fuzzy Logic for the Management of Uncertainty. New York: John
Wiley & Sons, Inc., 1992.
150
CHAPTER 8
In this chapter, we describe the JBoss Drools rule engine and how it may be used to control and orchestrate
Hadoop analysis pipelines. We describe an example rule-based controller which can be used for a variety of
data types and applications in combination with the Hadoop ecosystem.
■■Note Most of the configuration for using the JBoss Drools system is done using Maven dependencies. The
appropriate dependencies were shown in Chapter 3 when we discussed the initial setup of JBoss Drools. All the
dependencies you need to effectively use JBoss Drools are included in the example PROBDA system available
at the code download site.
Figure 8-1. Download the Drools software from the Drools web page
■■Note This book uses the latest released version of JBoss Drools, which was version 6.4.0 at the time of
writing this book. Update the drools.system.version property in your PROBDA project pom.xml if a new version
of JBoss Drools is available and you want to use it.
Let’s get started by installing JBoss Drools and testing some basic functionality. The installation process
is straightforward. From the JBoss Drools homepage, download the current version of Drools by clicking the
download button, as shown in Figure 8-1.
cd to the installation directory and run examples/run-examples.sh. You will see a selection menu
similar to that in Figure 8-2. Run some output examples to test the Drools system and observe the output in
the console, similar to that in Figure 8-3, or a GUI-oriented example, as in Figure 8-4.
152
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
Figure 8-2. Select some Drools examples and observe the results to test the Drools system
The built-in Drools examples has a menu from which you can select different test examples, as shown
in Figure 8-2. This is a good way to test your overall system set-up and get a sense of what the JBoss Drools
system capabilities are.
153
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
Some of the example components for JBoss Drools have an associated UI, as shown in Figure 8-3.
154
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
■■Note All of the example code found in this system is found in the accompanying example system
code base in the Java package com.apress.probda.rulesystem. Please see the associated README file and
documentation for additional notes on installation, versioning, and use.
The interface for timestamped Probda events in our system couldn’t be easier:
package com.probda.rulesystem.cep.model;
import java.util.Date;
155
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
156
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
Figure 8-5. Rule-based software systems control architecture, using JBoss Drools as a controller
157
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
Figure 8-6 shows what you can expect from the Maven reactor summary at the end of the Activiti build.
export TOMCAT_HOME=/usr/local/Cellar/tomcat/8.5.3
cd $ACTIVITI_HOME/scripts
./start-rest-no-jrebel.sh
158
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
159
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
A picture of the Activiti Explorer dashboard being run successfully is shown in Figure 8-8.
Figure 8-9. An initial Lucene-oriented system design, including user interactions and document processing
160
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
Figure 8-10. A Lucene-oriented system design, including user interactions and document processing, step 2
161
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
Figure 8-11. A Lucene-oriented system design, including user interactions and document processing, step 3
162
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
Figure 8-12. An integrated system architecture with lifecycle, including the technology components used
■■Note Please note you are under no obligation use the technology components shown in Figure 8-12. You
could use an alternative messaging component such as RabbitMQ, for example, instead of Apache Kafka, or
MongoDB instead of Cassandra, depending upon your application requirements.
8.5 Summary
In this chapter, we discussed using rule-based controllers with other distributed components, especially
with Hadoop and Spark ecosystem components. We have seen that a rule-based strategy can add a key
ingredient to distributed data analytics: the ability to organize and control data flow in a flexible and logically
organized manner. Scheduling and prioritization is a natural consequence of these rule-based techniques,
and we looked at some examples of rule-based schedulers throughout the chapter.
In the next chapter, we will talk about using the techniques we have learned so far into one integrated
analytical component which is applicable to a variety of use cases and problem domains.
163
Chapter 8 ■ Rule Engines, System Control, and System Orchestration
8.6 References
Amador, Lucas. Drools Developer Cookbook. Birmingham, UK: PACKT Publishing, 2012.
Bali, Michal. Drools JBoss Rules 5.0 Developers Guide. Birmingham, UK: PACKT Publishing, 2009.
Browne, Paul. JBoss Drools Business Rules. Birmingham, UK: PACKT Publishing, 2009.
Norvig, Peter. Paradigms of Artificial Intelligence: Case Studies in Common Lisp. San Mateo, CA:
Morgan-Kaufman Publishing, 1992.
164
CHAPTER 9
In this chapter, we describe an end-to-end design example, using many of the components discussed so
far. We also discuss “best practices” to use during the requirements acquisition, planning, architecting,
development, testing, and deployment phases of the system development project.
■■Note This chapter makes use of many of the software components discussed elsewhere throughout the
book, including Hadoop, Spark, Splunk, Mahout, Spring Data, Spring XD, Samza, and Kafka. Check Appendix A
for a summary of the components and insure that you have them available when trying out the examples from
this chapter.
Building a complete distributed analytical system is easier than it sounds. We have already discussed
many of the important ingredients for such a system in earlier chapters. Once you understand what your
data sources and sinks are going to be, and you have a reasonably clear idea of the technology stack to be
used and the “glueware” to be leveraged, writing the business logic and other processing code can become a
relatively straightforward task.
A simple end-to-end architecture is shown in Figure 9-1. Many of the components shown allow some
leeway as to what technology you actually use for data source, processors, data sinks and repositories, and
output modules, which include the familiar dashboards, reports, visualizations, and the like that we will see
in other chapters. In this example, we will use the familiar importing tool Splunk to provide an input source.
In the following section we will describe how to set up and integrate Splunk with the other components
of our example system.
Splunk (https://www.splunk.com) is a logging framework and is very easy to download, install, and
use. It comes with a number of very useful features for the kind of example analytics systems we’re
demonstrating here, including a built-in search facility.
To install Splunk, go to the download web page, create a user account, and download Splunk Enterprise
for your appropriate platform. All the examples shown here are using the MacOS platform.
Install the Splunk Enterprise appropriately for your chosen platform. On the Mac platform, if the
installation is successful, you will see Splunk represented in your Applications directories as shown in
Figure 9-2.
Refer to http://docs.splunk.com/Documentation/Splunk/6.4.2/SearchTutorial/StartSplunk
on how to start Splunk. Please note that the Splunk Web Interface can be found at
http://localhost:8000 when started correctly.
166
Chapter 9 ■ Putting It All Together: Designing a Complete Analytical System
167
Chapter 9 ■ Putting It All Together: Designing a Complete Analytical System
When you point your browser at localhost:8000, you’ll initially see the Splunk login page. Use the
default user name and password to begin with, change as instructed, and make sure the Java code you use
for connectivity uses your updated username (‘admin’) and password(‘changename’).
168
Chapter 9 ■ Putting It All Together: Designing a Complete Analytical System
cd splunk-library-javalogging
169
Chapter 9 ■ Putting It All Together: Designing a Complete Analytical System
In your Eclipse IDE, import the existing Maven project as shown in Figure 9-5.
Figure 9-5 shows a dialog for importing the existing Maven project to use splunk-library-javalogging.
170
Chapter 9 ■ Putting It All Together: Designing a Complete Analytical System
171
Chapter 9 ■ Putting It All Together: Designing a Complete Analytical System
Figure 9-7. Select the appropriate root directory for Maven construction
As shown in Figure 9-7, selection of the appropriate pom.xml is all you need to do in this step.
172
Chapter 9 ■ Putting It All Together: Designing a Complete Analytical System
As shown in Figure 9-8, modification to include appropriate username and password values is typically
all that is necessary for this step of installation.
173
Chapter 9 ■ Putting It All Together: Designing a Complete Analytical System
174
Chapter 9 ■ Putting It All Together: Designing a Complete Analytical System
Figure 9-10. Searching for Pro Data Analytics events in the Splunk dashboard
Textual search in the Splunk dashboard can be accomplished as in Figure 9-10. We can also select an
appropriate timestamped interval to perform queries over our data set.
Visualization is an important part of this integration process. Check out some of the D3 references at the
bottom of this chapter to get a sense of some of the techniques you can use in combination with the other
components of the data pipeline.
9.1 Summary
In this chapter, we discussed building a complete analytical system and some of the challenges architects
and developers encounter upon the way. We constructed a complete end-to-end analytics pipeline using
the now-familiar technology components discussed in earlier chapters. In particular, we talked about how to
use Splunk as an input data source. Splunk is a particularly versatile and flexible tool for all kinds of generic
logging events.
9.2 References
Mock, Derek, Johnson, Paul R., Diakun, Josh. Splunk Operational Intelligence Cookbook. Birmingham, UK:
PACKT Publishing, 2014.
Zhu, Nick Qi. Data Visualization with d3.js Cookbook. Birmingham, UK: PACKT Publishing, 2014.
175
PART III
The third part of our book describes the component parts and associated libraries which can assist
us in building distributed analytic systems. This includes components based on a variety of different
programming languages, architectures, and data models.
CHAPTER 10
In this chapter, we will talk about how to look at—to visualize—our analytical results. This is actually quite
a complex process, or it can be. It’s all a matter of choosing an appropriate technology stack for the kind
of visualizing you need to do for your application. The visualization task in an analytics application can
range from creating simple reports to full-fledged interactive systems. In this chapter we will primarily be
discussing Angular JS and its ecosystem, including the ElasticUI visualization tool Kibana, as well as other
visualization components for graphs, charts, and tables, including some JavaScript-based tools like D3.js
and sigma.js.
10.1 Simple Visualizations
One of the simplest visualization architectures is shown in Figure 10-1. The front-end control interface
may be web-based, or a stand-alone application. The control UI may be based on a single web page, or a
more developed software plug-in or multiple page components. “Glueware” on the front end might involve
visualization frameworks such as Angular JS, which we will discuss in detail in the following sections. On the
back end, glueware such as Spring XD can make interfacing to a visualizer much simpler.
Let’s talk briefly about the different components in Figure 10-1. Each circle represents different facets of
typical use cases when using an analytics software component. You might think of the circles as individual
sub-problems or issues we are trying to solve. For example, grouping, sorting, merging, and collating might
be handled by a standard tabular structure, such as the one shown in Figure 10-2. Most of the sorting and
grouping problems are solved with built-in table functionality like clicking on a column to sort rows, or to
group items.
Providing effective display capabilities can be as simple as selecting an appropriate tabular component
to use for row-oriented data. A good example of a tabular component which provides data import, sorting,
pagination, and easily programmable features is the one shown in Figure 10-2. This component is available
at https://github.com/wenzhixin/bootstrap-table. The control shown here leverages a helper library
called Bootstrap.js (http://getbootstrap.com/javascript/) to provide the advanced functionality. Being
able to import JSON data sets into a visualization component is a key feature which enables seamless
integration with other UI and back-end components.
180
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
Figure 10-2. One tabular control can solve several visualization concerns
Many of the concerns found in Figure 10-1 can be controlled by front-end controls we embed in a web
page. For example, we are all familiar with the “Google-style” text search mechanism, which consists of
just a text field and a button. We can implement a visualization tool using d3 that does simple analytics on
Facebook tweets as an introduction to data visualization. As shown in Figure 10-2 and Figure 10-3, we can
control the “what” of the display as well as the “how”: we can see a pie chart, bar chart, and bubble chart
version of the sample data set, which is coming from a Spring XD data stream.
181
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
Figure 10-3. Simple data visualization example of Twitter tweets using Spring XD showing trending topics
and languages
Most of the concerns we see in Figure 10-1 (data set selection, presentation type selection, and the
rest) are represented in Figure 10-3 and Figure 10-4. Standard controls, such as drop-down boxes, are used
to select data sets and presentation types. Presentation types may include a wide range of graph and chart
types, two- and three-dimensional display, and other types of presentation and report formats. Components
such as Apache POI (https://poi.apache.org) may be used to write report files in Microsoft formats
compatible with Excel.
The display shown here dynamically updates as new tweet data arrives through the Spring XD data
streams. Figure 10-3 shows a slightly different visualization of the tweet data, in which we can see how some
circles grow in size, representing the data “trending” in Twitter.
182
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
Figure 10-4. An additional simple data visualization example of Twitter tweets using Spring XD
We’ll discuss Spring XD in the next section, as it is particularly useful as glueware when building visualizers.
Setting up the Spring XD component, like all the Spring Framework components, is basically straightforward.
After installing Spring XD, start Spring XD in “single node mode” with
bin/xd-singlenode
cd bin
./xd-shell
183
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
184
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
185
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
Figure 10-7. Using Spring XD to implement a Twitter tweet stream and then sdeploy the stream
In the next section we will go into some comprehensive examples of a particularly useful toolkit,
Angular JS.
186
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
This will create the directories and files shown in Listing 10-2. Cd to the directory and make sure they
are really there.
./pom.xml
./src
./src/main
./src/main/resources
./src/main/webapp
./src/main/webapp/index.jsp
./src/main/webapp/WEB-INF
./src/main/webapp/WEB-INF/web.xml
187
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
Construct the new files and directories to configure the project, as shown in Listing 10-3.
mkdir -p src/main/java
mkdir -p src/test/java
mkdir -p src/test/javascript/unit
mkdir -p src/test/javascript/e2e
mkdir -p src/test/resources
rm -f ./src/main/webapp/WEB-INF/web.xml
rm -f ./src/main/webapp/index.jsp
mkdir -p ./src/main/webapp/css
touch ./src/main/webapp/css/specific.css
mkdir -p ./src/main/webapp/js
touch ./src/main/webapp/js/app.js
touch ./src/main/webapp/js/controllers.js
touch ./src/main/webapp/js/routes.js
touch ./src/main/webapp/js/services.js
touch ./src/main/webapp/js/filters.js
touch ./src/main/webapp/js/services.js
mkdir -p ./src/main/webapp/vendor
mkdir -p ./src/main/webapp/partials
mkdir -p ./src/main/webapp/img
touch README.md
touch .bowerrc
Run the npm initialization to interactively build the program. ‘npm init’ will provide a step-by-step
question-and-answer approach towards creating the project, as shown in Listing x.y.
npm init
188
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
{
"name": "java-angularjs-seed",
"version": "0.0.0",
"description": "A starter project for AngularJS combined with java and maven",
"main": "index.js",
"scripts": {
"test": "karma start test/resources/karma.conf.js"
},
"repository": {
"type": "git",
"url": "https://github.com/ivonet/java-angular-seed"
},
"author": "Ivo Woltring",
"license": "Apache 2.0",
"bugs": {
"url": "https://github.com/ivonet/java-angular-seed/issues"
},
"homepage": "https://github.com/ivonet/java-angular-seed"
}
{
"name": "java-angular-seed",
"private": true,
"version": "0.0.0",
"description": "A starter project for AngularJS combined with java and maven",
"repository": "https://github.com/ivonet/java-angular-seed",
"license": "Apache 2.0",
"devDependencies": {
"bower": "^1.3.1",
"http-server": "^0.6.1",
"karma": "~0.12",
"karma-chrome-launcher": "^0.1.4",
"karma-firefox-launcher": "^0.1.3",
"karma-jasmine": "^0.1.5",
"karma-junit-reporter": "^0.2.2",
"protractor": "~0.20.1",
"shelljs": "^0.2.6"
},
"scripts": {
"postinstall": "bower install",
"prestart": "npm install",
"start": "http-server src/main/webapp -a localhost -p 8000",
"pretest": "npm install",
"test": "karma start src/test/javascript/karma.conf.js",
"test-single-run": "karma start src/test/javascript/karma.conf.js --single-run",
"preupdate-webdriver": "npm install",
"update-webdriver": "webdriver-manager update",
189
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
Figure 10-9. Building the Maven stub for the Angular JS project successfully on the command line
190
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
Figure 10-11. Additional configuration file for the Angular JS example application
191
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
{
"directory": "src/main/webapp/vendor"
}
bower install angular#1.3.0-beta.14
bower install angular-route#1.3.0-beta.14
bower install angular-animate#1.3.0-beta.14
bower install angular-mocks#1.3.0-beta.14
bower install angular-loader#1.3.0-beta.14
bower install bootstrap
bower init
[?] name: java-angularjs-seed
[?] version: 0.0.0
[?] description: A java / maven / angularjs seed project
[?] main file: src/main/webapp/index.html
[?] what types of modules does this package expose?
[?] keywords: java,maven,angularjs,seed
[?] authors: IvoNet
[?] license: Apache 2.0
[?] homepage: http://ivonet.nl
[?] set currently installed components as dependencies? Yes
[?] add commonly ignored files to ignore list? Yes
[?] would you like to mark this package as private which prevents it from being
accidentally pub[?] would you like to mark this package as private which prevents it
from being accidentally published to the registry? Yes
...
{
"name": "java-angularjs-seed",
"version": "0.0.0",
"authors": [
"IvoNet <[email protected]>"
],
"description": "A java / maven / angularjs seed project",
"keywords": [
"java",
"maven",
"angularjs",
"seed"
],
"license": "Apache 2.0",
"homepage": "http://ivonet.nl",
"private": true,
"ignore": [
"**/.*",
"node_modules",
"bower_components",
"src/main/webapp/vendor",
192
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
"test",
"tests"
],
"dependencies": {
"angular": "1.3.0-beta.14",
"angular-loader": "1.3.0-beta.14",
"angular-mocks": "1.3.0-beta.14",
"angular-route": "1.3.0-beta.14",
"bootstrap": "3.2.0"
},
"main": "src/main/webapp/index.html"
}
rm -rf ./src/main/webapp/vendor
npm install
module.exports = function(config){
config.set({
basePath : '../../../',
files : [
'src/main/webapp/vendor/angular**/**.min.js',
'src/main/webapp/vendor/angular-mocks/angular-mocks.js',
'src/main/webapp/js/**/*.js',
'src/test/javascript/unit/**/*.js'
],
autoWatch : true,
frameworks: ['jasmine'],
browsers : ['Chrome'],
plugins : [
'karma-chrome-launcher',
'karma-firefox-launcher',
'karma-jasmine',
'karma-junit-reporter'
],
junitReporter : {
outputFile: 'target/test_out/unit.xml',
suite: 'src/test/javascript/unit'
}
});
};
193
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
194
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
<url>http://ivonet.nl</url>
195
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
<properties>
<artifact.name>app</artifact.name>
<endorsed.dir>${project.build.directory}/endorsed</endorsed.dir>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
<version>1.9.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>javax</groupId>
<artifactId>javaee-api</artifactId>
<version>7.0</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<finalName>${artifact.name}</finalName>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<compilerArguments>
<endorseddirs>${endorsed.dir}</endorseddirs>
</compilerArguments>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-war-plugin</artifactId>
<version>2.4</version>
<configuration>
<failOnMissingWebXml>false</failOnMissingWebXml>
</configuration>
</plugin>
196
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>2.6</version>
<executions>
<execution>
<phase>validate</phase>
<goals>
<goal>copy</goal>
</goals>
<configuration>
<outputDirectory>${endorsed.dir}</outputDirectory>
<silent>true</silent>
<artifactItems>
<artifactItem>
<groupId>javax</groupId>
<artifactId>javaee-endorsed-api</artifactId>
<version>7.0</version>
<type>jar</type>
</artifactItem>
</artifactItems>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
197
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
198
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
We can handcraft user interfaces to suit our application, or we have the option to use some of the
sophisticated visualization tools already available as stand-alone libraries, plug-ins, and toolkits.
Recall that we can visualize data sets directly from graph databases as well. For example, in Neo4j,
we can browse through the crime statistics of Sacramento after loading the CSV data set. Clicking on the
individual nodes causes a summary of the fields to appear at the bottom of the graph display, as shown in
Figure 10-16.
Figure 10-16. Browsing crime statistics as individual nodes from a query in a Neo4j graph database
10.5 Summary
In this chapter we looked at the visual side of the analytics problem: how to see and understand the results of
our analytical processes. The solution to the visualization challenge can be as simple as a CSV report in Excel
all the way up to a sophisticated interactive dashboard. We emphasized the use of Angular JS, a sophisticated
visualization toolkit based on the model-view-controller (MVC) paradigm.
In the next chapter, we discuss rule-based control and orchestration module design and implementation.
Rule systems are a type of control system with a venerable history in computer software, and have proven
their effectiveness in a wide range of control and scheduling applications over the course of time.
We will discover that rule-based modules can be a useful component in distributed analytics systems,
especially for scheduling and orchestrating individual processes within the overall application execution.
199
Chapter 10 ■ Data Visualizers: Seeing and Interacting with the Analysis
10.6 References
Ford, Brian, and Ruebbelke, Lukas. Angular JS in Action. Boston, MA: O’Reilly Publishing, 2015.
Freeman, Adam. Pro AngularJS. New York, NY: Apress Publishing, 2014.
Frisbie, Matt. AngularJS Web Application Development Cookbook. Birmingham England UK: PACKT
Publishing, 2013.
Murray, Scott. Interactive Data Visualization for the Web. Boston, MA: O’Reilly Publishing, 2013.
Pickover, Clifford A., Tewksbury, Stuart K. (eds). Frontiers of Scientific Visualization. New York, NY:
Wiley-Interscience, 1994,
Teller, Swizec. Data Visualization with d3.js. Birmingham England UK: PACKT Publishing 2013.
Wolff, Robert S., Yaeger, Larry. The Visualization of Natural Phenomena. New York, NY: Telos/
Springer-Verlag Publishing, 1993.
Zhu, Nick Qi. Data Visualization with D3.js Cookbook. Birmingham England UK: PACKT Publishing, 2013.
200
PART IV
In the final part of our book, we examine case studies and applications of the kind of distributed
systems we have discussed. We end the book with some thoughts about the future of Hadoop and
distributed analytic systems in general.
CHAPTER 11
In this chapter, we describe an application to analyze microscopic slide data, such as might be found in
medical examinations of patient samples or forensic evidence from a crime scene. We illustrate how a
Hadoop system might be used to organize, analyze, and correlate bioinformatics data.
■■Note This chapter uses a freely available set of fruit fly images to show how microscope images can be
analyzed. Strictly speaking, these images are coming from an electron microscope, which enables a much higher
magnification and resolution of the images than the ordinary optical microscope you probably first encountered
in high school biology. The principles of distributed analytics on a sensors data output is the same, however. You
might, for example, use images from a small drone aircraft and perform analytics on the images output from the
drone camera. The software components and many of the analytical operations remain the same.
11.1 Introduction to Bioinformatics
Biology has had a long history as a science, spanning many centuries. Yet, only in the last fifty years or so has
biological data used as computer data come into its own as a way of understanding the information.
Bioinformatics is the understanding of biological data as computer data, and the disciplined analysis of
that computer data. We perform bioinformatics by leveraging specialized libraries to translate and validate
the information contained in biological and medical data sets, such as x-rays, images of microscope slides,
chemical and DNA analysis, sensor information such as cardiograms, MRI data, and many other kinds of
data sources.
The optical microscope has been around for hundreds of years, but it is only relatively recently that
microscope slide images have been analyzed by image processing software. Initially, these analyses were
performed in a very ad-hoc fashion. Now, however, microscope slide images have become “big data” sets in
their own right, and can be analyzed by using a data analytics pipeline as we’ve been describing throughout
the book.
In this chapter, we examine a distributed analytics system specifically designed to perform the
automated microscope slide analysis we saw diagrammed in Figure 8-1. As in our other examples, we
will use standard third-party libraries to build our analytical system on top of Apache Hadoop and Spark
infrastructure.
For an in-depth description of techniques and algorithms for medical bioinformatics, see Kalet (2009).
Before we dive into the example, we should re-emphasize the point made in the node earlier in the
introduction. Whether we use electron microscopy images, optical images of a microscope slide, or even
more complex images such as the DICOM images that typically represent X-rays.
■■Note Several domain-specific software components are required in this case study, and include some
packages specifically designed to integrate microscopes and their cameras into a standard image-processing
application.
Figure 11-1. A microscope slide analytics example with software and hardware components
204
Chapter 11 ■ A Case Study in Bioinformatics: Analyzing Microscope Slide Data
The sample code example we will discuss in this chapter is based on the architecture shown in
Figure 11-1. Mostly we’re not concerned with the physical mechanics of the mechanisms, unless we want
fine control over the microscope’s settings. The analytics system begins where the image acquisition part
of the process ends. As with all of our sample applications, we go through a simple technology stack–
assembling phase before we begin to work on our customized code. Working with microscopes is a special
case of image processing, that is, “images as big data,” which we will discuss in more detail in Chapter 14.
As we select software components for our technology stack, we also evolve the high-level diagram of
what we want to accomplish in software. One result of this thinking might look like Figure 11-2. We have data
sources (which essentially come from the microscope camera or cameras), processing elements, analytics
elements, and result persistence. Some other components, such as a cache repository to hold intermediate
results, are also necessary.
Figure 11-2. A microscope slide software architecture: high-level software component diagram
205
Chapter 11 ■ A Case Study in Bioinformatics: Analyzing Microscope Slide Data
Figure 11-3. Original electron microscope slide image, showing a fruit fly tissue slice
206
Chapter 11 ■ A Case Study in Bioinformatics: Analyzing Microscope Slide Data
207
Chapter 11 ■ A Case Study in Bioinformatics: Analyzing Microscope Slide Data
208
Chapter 11 ■ A Case Study in Bioinformatics: Analyzing Microscope Slide Data
209
Chapter 11 ■ A Case Study in Bioinformatics: Analyzing Microscope Slide Data
We can use three-dimensional visualization tools to analyze a stack of neural tissue slices, as shown in
the examples in Figures 11-7 and 11-8.
package com.apress.probda.image;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
210
Chapter 11 ■ A Case Study in Bioinformatics: Analyzing Microscope Slide Data
package com.apress.probda.image;
import org.hipi.image.FloatImage;
import org.hipi.image.HipiImageHeader;
import org.hipi.imagebundle.mapreduce.HibInputFormat;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class ImageProcess extends Configured implements Tool {
211
Chapter 11 ■ A Case Study in Bioinformatics: Analyzing Microscope Slide Data
Figure 11-9. Successful population of HDFS with University of Virginia’s HIPI system
Check that the images have been loaded successfully with the HibInfo.sh tool by typing the following on
the command line:
212
Chapter 11 ■ A Case Study in Bioinformatics: Analyzing Microscope Slide Data
Figure 11-10. Successful description of HDFS images (with metadata information included)
11.4 Summary
In this chapter, we described an example application which uses distributed bioinformatics techniques to
analyze microscope slide data.
In the next chapter, we will talk about a software component based on a Bayesian approach to
classification and data modeling. This turns out to be a very useful technique to supplement our distributed
data analytics system, and has been used in a variety of domains including finance, forensics, and medical
applications.
213
Chapter 11 ■ A Case Study in Bioinformatics: Analyzing Microscope Slide Data
11.5 References
Gerhard, Stephan, Funke, Jan, Martel, Julien, Cardona, Albert, and Fetter, Richard. “Segmented anisotropic
ssTEM dataset of neural tissue.” Retrieved 16:09, Nov 20, 2013 (GMT) http://dx.doi.org/10.6084/
m9.figshare.856713
Kalet, Ira J. Principles of Biomedical Informatics. London, UK: Academic Press Elsevier, 2009.
Nixon, Mark S., and Aguado, Alberto S. Feature Extraction & Image Processing for Computer Vision,
Third Edition. London, UK: Academic Press Elsevier, 2008.
214
CHAPTER 12
In this chapter, we describe a Bayesian analysis software component plug-in which may be used to analyze
streams of credit card transactions in order to identify fraudulent use of the credit card by illicit users.
■■Note We will primarily use the Naïve Bayes implementation provided by Apache Mahout, but we will
discuss several potential solutions to using Bayesian analysis in general.
■■Note Bayesian analysis is a gigantic area of continually evolving concepts and technologies, which now
include deep learning and machine learning aspects. Some of the references at the end of the chapter provide
an overview of concepts, algorithms, and techniques, which have been used so far in Bayesian analysis.
Bayesian techniques are particularly relevant to an ongoing financial problem: the identification of
credit card fraud. Let’s take a look at a simple credit card fraud algorithm, as shown in Figure 18-1. The
implementation and algorithm shown is based on the work of Triparthi and Ragha (2004).
We will describe how to build a distributed credit card fraud detector based on the algorithm shown in
Figure 12-1, using some of the by now familiar strategies and techniques described in previous chapters.
Figure 12-1. A credit card fraud detection algorithm, following Triparthi and Ragha (2004)
First things first: add an environment variable to your .bash_profile file for this application:
export CREDIT_CARD_HOME=/Users/kkoitzsch/probda/src/main/resources/creditcard
First, lets get some credit card test data. We start with the data sets found at https://www.cs.purdue.edu/
commugrate/data/credit_card/. This data set was the basis for one of the Code Challenges of 2009. We are
only interested in these files:
DataminingContest2009.Task2.Test.Inputs
DataminingContest2009.Task2.Train.Inputs
DataminingContest2009.Task2.Train.Targets
216
Chapter 12 ■ A Bayesian Analysis Component: Identifying Credit Card Fraud
Let’s look at the structure of the credit card transaction record. Each line in the CSV file is a transaction
record consisting of the following fields:
amount,hour1,state1,zip1,custAttr1,field1,custAttr2,field2,hour2,flag1,total,field3,field4,i
ndicator1,indicator2,flag2,flag3,flag4,flag5
000000000025.90,00,CA,945,1234567890197185,3,[email protected],0,00,0,00000000
0025.90,2525,8,0,0,1,0,0,2
000000000025.90,00,CA,940,1234567890197186,0,[email protected],0,00,0,000000000025.90,3
393,17,0,0,1,1,0,1
000000000049.95,00,CA,910,1234567890197187,3,quhdenwubwydu@earthlink.
net,1,00,0,000000000049.95,-737,26,0,0,1,0,0,1
000000000010.36,01,CA,926,1234567890197202,2,[email protected],0,01,1,000000000010.3
6,483,23,0,0,1,1,0,1
000000000049.95,01,CA,913,1234567890197203,3,[email protected]
om,0,01,0,000000000049.95,2123,23,1,0,1,1,0,1
…and more.
Looking at the standard structure for the CSV line in this data set, we notice something about field
number 4: while it has a 16-digit credit-card-like code, it doesn’t conform to a standard valid credit card
number that would pass the Luhn test.
We write a program that will modify the event file to something more suitable: the fourth field of each
record will now contain a “valid” Visa or Mastercard randomly generated credit card number, as shown in
Figure 12-2. We want to introduce a few “bad” credit card numbers just to make sure our detector can spot them.
Figure 12-2. Merging valid and invalid “real” credit card numbers with test data
217
Chapter 12 ■ A Bayesian Analysis Component: Identifying Credit Card Fraud
218
Chapter 12 ■ A Bayesian Analysis Component: Identifying Credit Card Fraud
The Luhn credit card number verification algorithm is shown in the flowchart in Figure 12-3.
219
Chapter 12 ■ A Bayesian Analysis Component: Identifying Credit Card Fraud
Take a look at the algorithm flowchart in Figure 12-4. The process involves a training phase and a
detection phase.
Figure 12-4. Training and detection phases of a credit card fraud detection algorithm
220
Chapter 12 ■ A Bayesian Analysis Component: Identifying Credit Card Fraud
Figure 12-5. Starting Zookeeper from the command line or script is straightforward
Figure 12-6. Starting the Apache Storm supervisor from the command line
You can run the complete examples from the code contribution.
12.3 Summary
In this chapter, we discussed a software component developed around a Bayesian classifier, specifically
designed to identify credit card fraud in a data set. This application has been re-done and re-thought
many times, and in this chapter, we wanted to showcase an implementation in which we used some of the
software techniques we’ve already developed throughout the book to motivate our discussion.
In the next chapter, we will talk about a real-world application: looking for mineral resources with
a computer simulation. “Resource finding” applications are a common type of program in which real-
world data sets are mined, correlated, and analyzed to identify likely locations of a “resource,” which might
be anything from oil in the ground to clusters of trees in a drone image, or a particular type of cell in a
microscopic slide.
12.4 References
Bolstad, William M. Introduction to Bayesian Statistics. New York, NY: John Wiley and Sons, Inc., 2004.
Castillo, Enrique, Gutierrez, Jose Manuel, and Hadi, Ali S. Expert Systems and Probabilistic Network
Models. New York, NY: Springer-Verlag, 1997.
Darwiche, Adnan. Modeling and Reasoning with Bayesian Networks. New York, NY: Cambridge
University Press, 2009.
Kuncheva, Ludmila. Combining Pattern Classifiers: Methods and Algorithms. Hoboken, NJ: Wiley Inter-
Science, 2004.
Neapolitan, Richard E. Probabilistic Reasoning in Expert Systems: Theory and Algorithms. New York, NY:
John Wiley and Sons, Inc., 1990.
Shank, Roger, and Riesbeck, Christopher. Inside Computer Understanding: Five Programs Plus Miniatures.
Hillsdale, NJ: Lawrence Earlbaum Associates, 1981.
Tripathi, Krishna Kumar and Ragha, Lata. “Hybrid Approach for Credit Card Fraud Detection” in
International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-3, Issue-4,
September 2013.
221
CHAPTER 13
In this chapter, we discuss a particularly interesting application for distributed big data analytics: using
a domain model to look for likely geographic locations for valuable minerals, such as petroleum, bauxite
(aluminum ore), or natural gas. We touch on a number of convenient technology packages to ingest,
analyze, and visualize the resulting data, especially those well-suited for processing geolocations and other
geography-related data types.
■■Note In this chapter we use the Elasticsearch version 2.3. This version also provides the facility to use the
MapQuest map visualizations you will see throughout this chapter and elsewhere in the book.
In this type of system, four types of knowledge source are typically used, according to Khan “Prospector
Expert System” https://www.scribd.com/doc/44131016/Prospector-Expert-System: rules (similar to
those found in the JBoss Drools systems), semantic nets, and frames (a somewhat hybrid approach which
is discussed thoroughly in Shank and Abelson (1981). Like other object-oriented systems, frames support
inheritance, persistence, and the like.
In Figure 16.1, we show an abstracted view of a “hypothesis generator,” one way in which we can predict
resource locations, such as petroleum. The hypothesis generator for this example is based on JBoss Drools,
which we discussed in Chapter 8.
224
Chapter 13 ■ Searching for Oil: Geographical Data Analysis with Apache Mahout
Figure 13-2. A Mahout-based software component architecture for geographical data analysis
In the example program, we use a DBF importer program, such as the one shown in Listing 13-1, to
import data from DBF.
Elasticsearch is a very flexible data repository and a wide variety of data formats may be imported into it.
Download a few standard data sets just to get used to the Elasticsearch mechanisms. There are some
samples in:
https://www.elastic.co/guide/en/kibana/3.0/snippets/logs.jsonl
as well as in
Load sample data sets just for initially testing Elasticsearch and Kibana. You can try these:
■■Note In a previous chapter we used Apache Tika to read DBF files. In this chapter, we will use an
alternative DBF reader by Sergey Polovko (Jamel). You can download this DBF reader from GitHub at
https://github.com/jamel/dbf.
Listing 13-1. A simple DBF reader for geological data source information
package com.apress.probda.applications.oilfinder;
import java.io.File;
import java.util.Date;
import java.util.List;
225
Chapter 13 ■ Searching for Oil: Geographical Data Analysis with Apache Mahout
}
System.out.println("....Reading row: " + rownum + " into elasticsearch....");
rownum++;
System.out.println("------------------------");
return new OilData(); // customize your constructor here
}
});
/** We will flesh out this information class as we develop the example.
*
* @author kkoitzsch
*
*/
class OilData {
String _name;
int _value;
Date _createdAt;
public OilData(){
226
Chapter 13 ■ Searching for Oil: Geographical Data Analysis with Apache Mahout
Of course, reading the geographical data (including the DBF file) is really only the first step in the
analytical process.
Figure 13-3. A test query to verify Elasticsearch has been populated correctly with test data sets
227
Chapter 13 ■ Searching for Oil: Geographical Data Analysis with Apache Mahout
Figure 13-4. The Elasticserch-Hadoop connector and its relationship to the Hadoop ecosystem and HDFS
228
Chapter 13 ■ Searching for Oil: Geographical Data Analysis with Apache Mahout
We can threshold our values and supply constraints on “points of interest” (spacing, how many
points of interest per category, and other factors), to produce visualizations showing likelihood of desired
outcomes.
Evidence and probabilities of certain desired outcomes can be stored in the same data structure, as
shown in Figure 13-5. The blue regions are indicative of a likelihood that there is supporting evidence for
the desired outcome, in this case, the presence of petroleum or petroleum-related products. Red and yellow
circles indicate high and moderate points of interest in the hypothesis space. If the grid coordinates happen
to be geolocations, one can plot the resulting hypotheses on a map similar to those shown in Figure 13-6 and
Figure 13-7.
229
Chapter 13 ■ Searching for Oil: Geographical Data Analysis with Apache Mahout
Figure 13-6. Using Kibana and Elasticsearch for map visualiation in Texas example using latitude and
logitude, and simple counts of an attribute
We can run simple tests to insure Kibana and Elasticsearch are displaying our geolocation data correctly.
Now it is time to describe our Mahout analytical component. For this example, we will keep the
analytics very simple in order to outline our thought process. Needless to say, the mathematical models of
real-world resource finders would need to be much more complex, adaptable, and allow for more variables
within the mathematical model.
We can use another very useful tool to prototype and view some of our data content residing in Solr using
the Spatial Solr Sandbox tool by Ryan McKinley (https://github.com/ryantxu/spatial-solr-sandbox).
230
Chapter 13 ■ Searching for Oil: Geographical Data Analysis with Apache Mahout
Figure 13-7. Using the Spatial Solr Sandbox tool to query a Solr repository for geolocation data
231
Chapter 13 ■ Searching for Oil: Geographical Data Analysis with Apache Mahout
SC systems can use a variety of sensor types, image formats, image resolutions, and data ingestion rates,
and may use machine learning techniques, rule-based techniques, or inference processes to refine and
adapt feature identification for more accurate and efficient matching between satellite image features, such
as locations (latitude longitude information), image features (such as lakes, roads, airstrips, or rivers), and
man-made objects (such as buildings. shopping centers, or airports).
Users of an SC system may provide feedback as to the accuracy of the computed match, which in turn
allows the matching process to become more accurate over time as refinement takes place. The system may
operate specifically on features selected by the user, such as the road network or man-made features such as
buildings.
Finally, the SC matching process provides accuracy measures of the matches between images and
ground truth data, as well as complete error and outlier information to the user in the form of reports or
dashboard displays.
SC systems can provide an efficient and cost-effective way to evaluate satellite imagery for quality,
accuracy, and consistency within an image sequence, and can address issues of high-resolution accuracy,
task time to completion, scalability, and near real-time processing of satellite imagery, as well as providing a
high-performance software solution for a variety of satellite image evaluation tasks.
One useful component to include in geolocation-centric systems is Spatial4j (https://github.com/
locationtech/spatial4j), a helper library which provides spatial and geolocation functionality for Java
programs, evolved from some of the earlier work such as the Spatial Solr Sandbox toolkit discussed earlier.
Figure 13-8. Running the tests for Spatial4j, a commonly used geolocation java toolkit library
232
Chapter 13 ■ Searching for Oil: Geographical Data Analysis with Apache Mahout
13.3 Summary
In this chapter, we talked about the theory and practice of searching for oil and other natural resources
using big data analytics as a tool. We were able to load DBF data, manipulate and analyze the data with
Mahout=based code, and output the results to a simple visualizer. We also talked about some helpful
libraries to include in any geolocation-centric application, such as Spatial4j and SpatialHadoop.
In the next chapter, we will talk about a particularly interesting area of big data analytics: using images
and their metadata as a data source for our analytical pipeline.
13.4 References
Gheorghe, Radu, Hinman, Matthew Lee, and Russo, Roy. Elasticsearch in Action. Sebastopol, CA: O’Reilly
Publishing, 2015.
Giacomelli, Piero. Apache Mahout Cookbook. Birmingham, UK: PACKT Publishing, 2013.
Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout in Action. Shelter Island, NY:
Manning Publications, 2011.
233
CHAPTER 14
In this chapter, we will provide a brief introduction to an example toolkit, the Image as Big Data Toolkit
(IABDT), a Java-based open source framework for performing a wide variety of distributed image processing
and analysis tasks in a scalable, highly available, and reliable manner. IABDT is an image processing
framework developed over the last several years in response to the rapid evolution of big data technologies
in general, but in particular distributed image processing technologies. IABDT is designed to accept many
formats of imagery, signals, sensor data, metadata, and video as data input.
A general architecture for image analytics, big data storage, and compression methods for imagery
and image-derived data is discussed, as well as standard techniques for image-as-big-data analytics. A
sample implementation of our image analytics architecture, IABDT addresses some of the more frequently
encountered challenges experienced by the image analytics developer, including importing images into
a distributed file system or cache, image preprocessing and feature extraction, applying the analysis and
result visualization. Finally, we showcase some of the features of IABDT, with special emphasis on display,
presentation, reporting, dashboard building, and user interaction case studies to motivate and explain our
design and methodology stack choices.
Our example toolkit IABDT provides a flexible, modular architecture which is plug-in-oriented. This
makes it possible to combine many different software libraries, toolkits, systems, and data sources within
one integrated, distributed computational framework. IABDT is a Java- and Scala-centric framework, as it
uses both Hadoop and its ecosystem as well as the Apache Spark framework with its ecosystem to perform
the image processing and image analytics functionality.
IABDT may be used with NoSQL databases such as MongoDB, Neo4j, Giraph, or Cassandra, as well
as with more traditional relational database systems such as MySQL or Postgres, to store computational
results and serve as data repositories for intermediate data generated by pre- and post-processing stages in
the image processing pipeline. This intermediate data might consist of feature descriptors, image pyramids,
boundaries, video frames, ancillary sensor data such as LIDAR, or metadata. Software libraries such as
Apache Camel and Spring Framework may be used as “glue” to integrate components with one another.
One of the motivations for creating IABDT is to provide a modular extensible infrastructure for
performing preprocessing, analysis, as well as visualization and reporting of analysis results—specifically
for images and signals. They leverage the power of distributed processing (as with the Apache Hadoop and
Apache Spark frameworks) and are inspired by such toolkits as OpenCV, BoofCV, HIPI, Lire, Caliph, Emir,
Image Terrier, Apache Mahout, and many others. The features and characteristics of these image toolkits
are summarized in Table 14-1. IABDT provides frameworks, modular libraries, and extensible examples
to perform big data analysis on images using efficient, configurable, and distributed data pipelining
techniques.
Image as Big Data toolkits and components are becoming resources in an arsenal of other distributed
software packages based on Apache Hadoop and Apache Spark, as shown in Figure 14-1.
236
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Some of the distributed implementations of the module types in Figure 14-1 which are implemented in
IABDT include:
Genetic Systems. There are many genetic algorithms particularly suited to image analytics1, including
techniques for sampling a large solution space, feature extraction, and classification. The first two categories
of technique are more applicable to the image pre-processing and feature extraction phases of the analytical
process and distributed classification techniques—even those using multiple classifiers.
Bayesian Techniques. Bayesian techniques include the naïve Bayesian algorithm found in most
machine learning toolkits, but also much more.
Hadoop Ecosystem Extensions. New extensions can be built on top of existing Hadoop components to
provide customized “image as big data” functionality.
Clustering, Classification, and Recommendation. These three types of analytical algorithms are present
in most standard libraries, including Mahout, MLib, and H2O, and they form the basis for more complex
analytical systems.
Hybrid systems integrate a lot of disparate component types into one integrated whole to perform
a single function. Typically hybrid systems contain a control component, which might be a rule-based
system such as Drools, or other standard control component such as Oozie, which might be used for
237
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
scheduling tasks or other purposes, such as Luigi for Python (https://github.com/spotify/luigi) ), which
comes with built-in Hadoop support. If you want to try Luigi out, install Luigi using Git, and clone it into a
convenient subdirectory:
git clone
https://github.com/spotify/luigi?cm_mc_uid=02629589701314462628476&cm_mc_
sid_50200000=1457296715
./luigid
238
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Figure 14-3. A HIPI image data flow, consisting of bunding, culling, map/shuffle and reduce to end result
HIPI image bundle, or “HIB,” is the structured storage method used by HIPI to group images into one
physical unit. The cull phase allows each HIB to be filtered out based on appropriate programmatic criteria.
Images that are culled out are not fully decoded, making the HIPI pipeline much more efficient. The output
of the cull phase results in image sets as shown in the diagram. Each image set has its own map phase,
followed by a shuffle phase and corresponding reduce steps to create the final result. So, as you can see, the
HIPI data flow is similar to the standard map-reduce data flow process. We reproduce the Hadoop data flow
process in Figure 14-4 for your reference.
239
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Figure 14-4. A reference diagram of the classic map-reduce data flow, for comparison with 14-3
This will install the source code into a “hipi” directory. Cd to this “hipi” directory and “ls” the contents
to review. You will need a Gradle build tool installation to install from the source. The resulting build will
appear similar to Figure 14-5.
240
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Gradle is another useful installation and build tool which is similar to Maven. Some systems, such as
HIPI, are much easier to install using Gradle than with other techniques such as Maven.
241
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Figure 14-6. Example using HIPI info utility: Mage info about a 10-image HIB in the HIPI system
Installation of HIPI is only the first step, however! We have to integrate our HIPI processor with the
analytical components to produce our results.
242
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Table 14-2. A selection of methods from the Image as Big Data Toolkit
Table 14-3. Display methods for visualization provided by the IABDT. Most object types in the IABDT may be
displayed using similar methods
243
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
The image data source processor is the component responsible for data acquisition, image cleansing,
format conversion, and other operations to “massage” the data into formats acceptable to the other pipeline
components.
The analytical engine components can be support libraries such as R and Weka.
Intermediate data sources are the outputs of initial analytical computation.
The user control dashboard is an event handler, interactive component.
The control and configuration modules consist of rule components such as Drools or other rule engine
or control components, and may contain other “helper” libraries for tasks such as scheduling, data filtering
and refinement, and overall system control and configuration tasks. Typically, ZooKeeper and/or Curator
may be used to coordinate and orchestrate the control and configuration tasks.
The distributed system infrastructure module contains underlying support and “helper” libraries.
The persistent result repositories can be any of a variety of types of data sink, including relational,
graph, or NoSQL type databases. In-memory key-value data stores may also be used if appropriate.
The reporting modules typically consist of old-school tabular or chart presentations of analytical results.
User interaction, control, and feedback is supplied by the IABDT interaction modules, which include
default dashboards for common use cases.
244
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Visualization modules consist of support libraries for displaying images, overlays, feature locations, and
other visual information which make interacting and understanding the data set easier.
Consolidated views of the same objects, image displays which process image sequences, and image
overlay capability are all provided by the IABD toolkit.
Dashboard, display, and interactive interfaces—both standalone application and web based—may
be built with the IABDT user interface building module. Support for standard types of display, including
overlays, and geolocation data, are provided in the prototype IABDT.
245
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
n n
mpq = å åx
x =- n y =- n
p
y q g ( x ,y )
Where g(x,y) is a two-dimensional index into the image g. A so-called central moment may be defined as
n n
å å ( x - x¢) ( y - y¢) g ( x , y )
p q
mpq =
x =- n y =- n
g
m pq = mpq
where
g=
(p + q) +1
2
246
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Rotation and scale invariant central moments can be characterized, following Hu:
f1 = ( m20 + m02 )
f6 = ( m20 - m02 ) é ( m30 + m12 ) – ( m21 + m03 ) ù + 4 m11 ( m30 + m12 ) ( m21 + m03 )
2 2
ë û
f7 = ( 3 m21 - m03 ) ( m30 + m12 ) é ( m30 + m12 ) - 3 ( m21 + m03 ) ù - ( m30 - 3 m12 ) ( m12 + m03 ) é ( 3 m30 + m12 ) - ( m21 + m03 ) ù
2 2 2 2
ë û ë û
A map/reduce task in Hadoop can be coded explicitly from the moment equations, first in java
for experimental purposes — to test the program logic and make sure the computed values conform
to expectations — and then converted to the appropriate map/reduce constructs. A sketch of the
java implementation is shown in Listing 14-1. We use a standard java class, com.apress.probda.core.
ComputationalResult, to hold the answers and the “centroid” (which is also computed by our algoirithm):
247
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
From this simple java implementation, we can then implement map, reduce, and combine methods
with signatures such as those shown in Listing 14-2.
Listing 14-2. HIPI map/reduce method signatures for moment feature extraction computation
// Method signatures for the map() and reduce() methods for
// moment feature extraction module
public void map(HipiImageHeader header, FloatImage image, Context context) throws
IOException,
InterruptedException
Lets recall the microscopy example from Chapter 11. It’s a pretty typical un-structured data pipeline
processing analysis problem in some ways. As you recall, image sequences start out as an ordered list of
images — they may be arranged by timestamp or in more complex arrangements such as geolocation, stereo
pairing, or order of importance. You can imagine in a medical application which might have dozens of
medical images of the same patient, those with life-threatening anomalies should be brought to the front of
the queue as soon as possible.
248
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Other image operations might be good candidates for distributed processing, such as the Canny edge
operation, coded up in BoofCV in Listing 14-3.
import java.awt.image.BufferedImage;
import java.util.List;
import com.kildane.iabdt.model.Camera;
import boofcv.alg.feature.detect.edge.CannyEdge;
import boofcv.alg.feature.detect.edge.EdgeContour;
import boofcv.alg.filter.binary.BinaryImageOps;
import boofcv.alg.filter.binary.Contour;
import boofcv.factory.feature.detect.edge.FactoryEdgeDetectors;
import boofcv.gui.ListDisplayPanel;
import boofcv.gui.binary.VisualizeBinaryData;
import boofcv.gui.image.ShowImages;
import boofcv.io.UtilIO;
import boofcv.io.image.ConvertBufferedImage;
import boofcv.io.image.UtilImageIO;
import boofcv.struct.ConnectRule;
import boofcv.struct.image.ImageSInt16;
import boofcv.struct.image.ImageUInt8;
249
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Interest points are well-defined, stable image space locations which have “particular interests.” For
example, you might notice in Figure 14-9 that the points of interest occur at the junction points connecting
other structures in the image. Corners, junctions, contours, and templates may be used to identify what we
are looking for within images, and statistical analysis can be performed on the results we find.
250
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Figure 14-9. Finding interest points in an image: the circled + signs are the interest points
251
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
Figure 14-10. Input process for IABD Toolkit, showing image preprocessing components
Data sources may be processed in “batch mode” or in “streaming mode” by the data flow pipeline. The
data source preprocessor is . The image data source preprocessor may perform image-centric preprocessing
such as feature extraction, region identification, image pyramid construction, and other tasks to make the
image processing part of the pipeline easier.
æ 2p ö
1 N -1N -1 - jç ÷( ux +vy )
FPu ,v = åå
N x =0 y =0
Px ,y e è N ø
Canny Edge Operators. The Canny operator can be approximated by the steps of Gaussian smoothing,
the Sobel operator — a non-maximal suppression stage, thresholding (with hysteresis — a special kind of
thresholding) to connecting edge points. The extracted two dimensional shapes may be persisted to an
IABDT data source.
Line, Circle, and Ellipse Extraction Operators. There are feature extraction algorithms for line, circle,
and ellipse shape primitives from two dimensional image data. Several sample implementations are
included in the toolkit.
252
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
14.9 Terminology
Below is a brief summary of some of the terms associated with image processing and ‘image as big data’
concepts.
Agency-Based Systems: Cooperative multi-agent systems, or agencies, are an effective way to design
and implement IABD systems. Individual agent node processes cooperate in a programmed network
topology to achieve common goals.
Bayesian Image Processing: Array-based image processing using Bayesian techniques typically
involves constructing and computing with a Bayes network, a graph in which the nodes are considered as
random variables, and the graph edges are conditional dependencies. Random variables and conditional
dependencies are standard Bayesian concepts from the fundamental Bayesian statistics. Following Opper
and Winther, we can characterize Bayesian optimal prediction as
Object hypotheses, prediction, and sensor fusion are typical problem areas for Bayesian image
processing.
Classification Algorithm: Distributed classification algorithms within the IABDT include large-
and small- margin (a margin is the confidence level of a classification) classifiers. A variety of techniques
including genetic algorithms, neural nets, boosting, and support vector machines (SVMs) may be used
for classification. Distributed classification algorithms, such as the standard k-means, or fuzzy-k-means
techniques, are included in standard support libraries such as Apache Mahout.
Deep Learning (DL): A branch of machine learning based on learning-based data representations, and
algorithms modeling high-level data abstractions. Deep learning uses multiple, complex processing levels
and multiple non-linear transformations.
Distributed System: Software systems based on a messaging passing architecture over a networked
hardware topology. Distributed systems may be implemented in part by software frameworks such as
Apache Hadoop and Apache Spark.
Image As Big Data (IABD): The IABD concept entails treating signals, images, and video in some
ways, as any other source of “big data”, including the 4V conceptual basis of “variety, volume, velocity,
and veracity”. Special requirements for IABD include various kinds of automatic processing, such as
compression, format conversion, and feature extraction.
Machine learning (ML): Machine learning techniques may be used for a variety of image processing
tasks, including feature extraction, scene analysis, object detection, hypothesis generation, model building
and model instantiation.
Neural net: Neural nets are a kind of mathematical model which emulate the biological models of high-
level reasoning in humans. Many types of distributed neural net algorithm are useful for image analysis,
feature extraction, and two- and three- dimensional model building from images.
Ontology-driven modeling: Ontologies as a description of entities within a model and the
relationships between these entities, may be developed to drive and inform a modeling process, in which
model refinements, metadata, and even new ontological forms and schemas, are evolved as an output of the
modeling process.
Sensor fusion: Combination of information from multiple sensors or data sources into an integrated,
consistent, and homogeneous data model. Sensor fusion may be accomplished by a number of
mathematical techniques, including some Bayesian techniques.
Taxonomy: A scheme of classification and naming which builds a catalog. Defining, generating, or
modeling a hierarchy of objects may be helped by leveraging taxonomies and related ontological data
structures and processing techniques.
253
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
14.10 Summary
In this chapter, we discussed the ‘image as big data’ concept and why it is an important concept in the world
of big data analytics techniques. The current architecture, features, and use cases for a new image-as-big-
data toolkit (IABDT), was described. In it, the complementary technologies of Apache Hadoop and Apache
Spark, along with their respective ecosystems and support libraries, have been unified to provide low-level
image processing operations — as well as sophisticated image analysis algorithms which may be used to
develop distributed, customized image processing pipelines.
In the next chapter, we discuss how to build a general-purpose data processing pipeline using many of
the techniques and technology stacks we’ve learned from previous chapters in the book.
14.11 References
Akl, Selim G. (1989). The Design and Analysis of Parallel Algorithms. Englewood Cliffs, NJ: Prentice Hall.
Aloimonos, J., & Shulman, D. (1989). Integration of Visual Modules: An Extension of the Marr Paradigm.
San Diego, CA: Academic Press Professional Inc.
Ayache, N. (1991). Artificial Vision for Mobile Robots: Stereo Vision and Multisensory Perception.
Cambridge, MA: MIT Press.
Baggio, D., Emami, S., Escriva, D, Mahmood, M., Levgen, K., Saragih, J. (2011). Mastering OpenCV with
Practical Computer Vision Projects. Birmingham, UK: PACKT Publications.
Barbosa, Valmir. (1996). An Introduction of Distributed Algorithms. Cambridge, MA: MIT Press.
Berg, M., Cheong,O., Krevald, V. M., Overmars, M. (Ed.). (2008). Computational Geometry: Algorithms
and Applications. Berlin Heidelberg, Germany: Springer-Verlag.
Bezdek, J. C., Pal, S. K. (1992).(Ed.) Fuzzy Models for Pattern Recognition: Methods That Search for
Structures in Data. New York, NY: IEEE Press.
Blake, A., and Yuille, A. (Ed.). (1992). Active Vision. Cambridge, MA: MIT Press.
Blelloch, G. E. (1990). Vector Models for Data-Parallel Computing. Cambridge, MA: MIT Press.
Burger, W., & Burge, M. J. (Ed.). (2016). Digital Image Processing: An Algorithmic Introduction Using
Java, Second Edition. London, U.K. :Springer-Verlag London.
Davies, E.R. (Ed.). (2004). Machine Vision: Theory, Algorithms, Practicalities. Third Edition. London,
U.K: Morgan Kaufmann Publishers.
Faugeras, O. (1993). Three Dimensional Computer Vision: A Geometric Viewpoint. Cambridge, MA: MIT
Press.
Freeman, H. (Ed.) (1988). Machine Vision: Algorithms, Architectures, and Systems. Boston, MA:
Academic Press, Inc.
Giacomelli, Piero. (2013). Apache Mahout Cookbook. Birmingham, UK: PACKT Publishing.
Grimson, W. E. L.; Lozano-Pez, T.;Huttenlocher, D. (1990). Object Recognition by Computer: The Role of
Geometric Constraints. Cambridge, MA: MIT Press.
Gupta, Ashish. (2015). Learning Apache Mahout Classification. Birmingham, UK: PACKT Publishing.
Hare, J., Samangooei, S. and Dupplaw, D. P. (2011). OpenIMAJ and ImageTerrier: Java libraries and
tools for scalable multimedia analysis and indexing of images. In Proceedings of the 19th ACM international
conference on Multimedia (MM '11). ACM, New York, NY, USA, 691-694. DOI=10.1145/2072298.2072421
http://doi.acm.org/10.1145/2072298.2072421
Kulkarni, A. D. (1994). Artificial Neural Networks for Image Understanding. New York, NY: Van Nostrand
Reinhold.
Kumar, V., Gopalakrishnan, P.S., Kanal, L., (Ed.). (1990). Parallel Algorithms for Machine Intelligence and
Vision. New York NY: Springer-Verlag New York Inc.
Kuncheva, Ludmilla I. (2004). Combining Pattern Classifiers: Methods and Algorithms. Hoboken, New
Jersey, USA: John Wiley & Sons.
254
Chapter 14 ■ “Image As Big Data” Systems: Some Case Studies
255
CHAPTER 15
In this chapter, we detail an end-to-end analytical system using many of the techniques we discussed
throughout the book to provide an evaluation system the user may extend and edit to create their own
Hadoop data analysis system. Five basic strategies to use when developing data pipelines are discussed.
Then, we see how these strategies may be applied to build a general purpose data pipeline component.
Let’s look at a more real-world example of a general purpose data pipeline. One of the simplest useful
configurations is shown in Figure 15-2. It consists of a data source (in this case HDFS), a processing element
(in this case Mahout), and an output stage (in this case a D3 visualizer which is part of the accompanying Big
Data Toolkit).
Figure 15-2. A real-world distributed pipeline can consist of three basic elements
Our first example imports a data set into HDFS, performs some simple analytics processing using
Mahout, and passes the results of the analysis to a simple visualization component.
258
Chapter 15 ■ Building a General Purpose Data Pipeline
• Treat the “business logic” as a black box. Initially concentrate on data input
and output as well as the supporting technology stack. If the business logic is
relatively simple, already packaged as a library, well-defined and straightforward to
implement, we can treat the business logic component as a self-contained module or
“plug-in.” If the business logic requires hand-coding or is more complex
15.3.2 Middle-Out Development
Middle-out development means what it says: starting in the “middle” of the application construct and
working towards either end, which in our examples will always be the data sources at the beginning of the
process and the data sinks or final result repository at the end of the data pipeline. The “middle” we’re
developing first is essentially the “business logic” or “target algorithms” to be developed. We can start with
general technology stack considerations (such as the choice to use Hadoop, Spark, or Flink, for example, or a
hybrid approach using one or more of these).
We can use any of the freely available EIP diagram editors, such as the draw.io tool (draw.io) or
Omnigraffle (omnigraffle.com), to draw EIP diagrams. We can then use Spring Integration or Apache Camel
to implement the pipelines.
A full description of the EIP notation can be found in Hohpe and Woolf (2004).
The components shown in the abstract diagram Figure 15-4 can be implemented using Apache Camel
or Spring Integration. The two endpoints are data ingestion and data persistence, respectively. The small TV
screen–like symbol indicates a data visualization component and/or management console.
259
Chapter 15 ■ Building a General Purpose Data Pipeline
Figure 15-5 shows a typical architecture for a rule-based data pipeline in which all the processing
components in the pipe are controlled by the rule-based workflow/data management component. Let’s look
at how such an architecture might be implemented.
260
Chapter 15 ■ Building a General Purpose Data Pipeline
Figure 15-6. An EIP diagram showing a different incarnation of the data pipeline
15.4 Summary
In this chapter, we discussed construction of a general purpose data pipeline. General purpose data
pipelines are an important starting point in big data analytical systems: both conceptually and in real world
application building. These general purpose pipelines serve as a staging area for more application-specific
extensions, as well as experimental proof-of-concept systems which may require more modification and
testing before they are developed further. Starting on a strong general-purpose technology base makes it
easier to perform re-work efficiently, and to “take a step back” if application requirements change.
Five basic pipeline building strategies were discussed: working from sources and sinks, middle-out
development (analytical stack-centric development), enterprise integration pattern (EIP) pipeline
development, rule-based messaging pipelines, and control + data (control flow) pipelining. Support
libraries, techniques, and code which supports these five general purpose pipelining strategies were also
discussed.
In the next and final chapter, we discuss directions for the future of big data analytics and what the
future evolution of this type of system might look like.
261
Chapter 15 ■ Building a General Purpose Data Pipeline
15.5 References
Hohpe, Gregor, and Woolf, Bobby. Enterprise Integration Patterns: Designing, Building, and Deploying
Messaging Solutions. Boston, MA: Addison-Wesley Publishing, 2004.
Ibsen, Claus, and Ansley, Jonathan. Camel in Action. Stamford, CT: Manning Publications, 2011.
Kavis, Michael. Architecting the Cloud: Design Decisions for Cloud Computing Service Models. Hoboken, NJ:
John Wiley and Sons, Inc., 2014.
Mak, Gary. Spring Recipes: A Problem-Solution Approach. New York, NY: Springer-Verlag, Apress
Publishing, 2008.
262
CHAPTER 16
In this final chapter, we sum up what we have learned in the previous chapters and discuss some of the
developing trends in big data analytics, including “incubator” projects and “young” projects for data
analysis. We also speculate on what the future holds for big data analysis and the Hadoop ecosystem—
“Future Hadoop” (which may also include Apache Spark and others).
Please keep in mind that the 4Vs of big data (velocity, veracity, volume, and variety) will only become
larger and more complex over time. Our main conclusion: the scope, range, and effectiveness of big data
analytics solutions must also continue to grow accordingly in order to keep pace with the data available!
3.
Be able to accommodate different programming languages appropriately, in
as seamless a manner as possible. As a consequence of the need to choose a
technology stack selectively, even some of the simplest applications are multi-
language applications these days, and may contain Java, JavaScript, HTML, Scala,
and Python components within one framework.
4.
Select appropriate “glueware” for component integration, testing, and
optimization/deployment. As we have seen in the examples throughout this
book, “glueware” is almost as important as the components being glued!
Fortunately for the developer, many components and frameworks exist for this
purpose, including Spring Framework, Spring Data, Apache Camel, Apache Tika,
and specialized packages such as Commons Imaging and others.
5.
Last but not least, maintain a flexible and agile methodology to adapt systems to
newly discovered requirements, data sets, changing technologies, and volume/
complexity/quantity of data sources and sinks. Requirements will constantly
change, as will support technologies. An adaptive approach saves time and
rework in the long run.
In conclusion, we have come to believe that following the strategic approach to system building
outlined above will assist architects, developers, and managers achieve functional business analytics
systems which are flexible, scalable, and adaptive enough to accommodate changing technologies, as well as
being able to process challenging data sets, build data pipelines, and provide useful and eloquent reporting
capabilities, including the right data visualizations to express your results in sufficient detail.
1
In The Scholar and the Future of the Research Library, Fremont Rider describes his solution to the information
explosion of the times. It’s good reading for anyone interested in how fundamental technical problems reassert them-
selves in different forms over time.
264
Chapter 16 ■ Conclusions and the Future of Big Data Analysis
We’ve come a long way from the perforated card and mechanical calculator, through microfilm
solutions like Rider’s and on to the electronic computer; but keep in mind that many computational and
analytical problems remain the same. As computational power increases, data volume and availability
(sensors of all kinds in great number putting out data) will require not only big data analytics, but a process
of so-called “sensor fusion,” in which different kinds of structured, semi-structured, and unstructured data
(signals, images, and streams of all shapes and sizes) must be integrated into a common analytical picture.
Drones and robot technology are two areas in which “future Hadoop” may shine, and robust sensor fusion
projects are already well underway.
Statistical analysis still has its place in the world of big data analysis, no matter how advanced software
and hardware components become. There will always be a place for “old school” visualization of statistics,
as shown in Figure 16-1 and Figure 16-2. As for the fundamental elements of classification, clustering,
feature analysis, identification of trends, commonalities, matching, etc., we can expect to see all these
basic techniques recast into more and more powerful libraries. Data and metadata formats—and, most
importantly, their standardization and adoption throughout the big data community—will allow us to evolve
the software programming paradigms for BDAs over the next few decades.
Figure 16-1. Different kinds of “old school” bar graphs can be used to summarize grouped data
265
Chapter 16 ■ Conclusions and the Future of Big Data Analysis
Figure 16-2. “Old school” candlestick graphs can still be used to summarize chronological data
When we think about the current state of big data analysis, many questions immediately come to mind. One
immediate question is, when we solve a data analytics problem, how much ground do we have to cover? What
is the limit of business analytics as far as components go (keeping in mind our problem definition and scope)?
Where does business analytics end and other aspects of information technology and computer science begin?
Lets take a quick review of what “business analytics” really is, as far as components go. We might start
with a laundry list of components and functionalities like this:
1.
Data Warehouse Components. Apache Hive started out as the go-to data
warehousing technology for use with Hadoop, and is still intensively used by a
vast number of software applications.
2.
Business Intelligence (BI) Functionalities. The traditional definition of “Business
intelligence” (BI) includes data and process mining, predictive analytics, and
event processing components, but in the era of distributed BI, may also include
components involving simulation, deep learning, and complex model building.
BI may offer a historical, current, or predictive view of data sets, and may assist in
the domain of “operational analytics,” the improvement of existing operations by
application of BI solutions.
3.
Enterprise Integration Management (EIM). EIM is assisted by the whole area of
Enterprise Integration Patterns (EIPs). Many software components, including
“glueware” such as Apache Camel, are based on implementation of all or most of the
EIPs found in the classic book by Hohpe and Woolf, Enterprise Integration Patterns.2
2
Gregor Hohpe, Bobby Woolf. (2003) Enterprise Integration Patterns Designing, Building, and Deploying Messaging
Systems. Addison Wesley. ISBN 978-0321200686
266
Chapter 16 ■ Conclusions and the Future of Big Data Analysis
4.
Enterprise Performance Management (EPM). EPM is an area of great interest
for some vendors, particularly Cloudera. One interesting and perceptive article
about this is “3 Ways ‘Big Data Analytics’ Will Change Enterprise Performance
Management,” by Bernard Marr.3
5.
Analytic Applications (Individual Components and Functionality). Many
incubating and completely new libraries and frameworks await!
6.
Key Functional Requirements: Governance, Risk, and Compliance Management
with Auditing.
7.
Security and integrated security consistently provided throughout the core,
support ecosystem, and distributed analytics application. In the early days
of Hadoop development, many components within the Hadoop ecosystem
had inadequate security considerations. Data provenance and monitoring-
distributed systems in real time are only two of the challenges facing “future
Hadoop,” but they are important examples of the need for improved security
measures throughout Hadoop- and Apache Spark-distributed systems.
Big data analytics capabilities will only continue to grow and prosper. Hardware and software
technologies, including a new renaissance of Artificial Intelligence research and innovation, contributes to
the Machine Learning and Deep Learning technologies so necessary to the further evolution of Big Data
analytical techniques. Open source libraries and thriving software communities make development of new
systems much more facile, even when using off-the-shelf components.
3
http://www.smartdatacollective.com/bernardmarr/47669/3-ways-big-data-analytics-will-change-
enterprise-performance-management
Is the above link going to be valid as long as this book is in use?
267
Chapter 16 ■ Conclusions and the Future of Big Data Analysis
268
Chapter 16 ■ Conclusions and the Future of Big Data Analysis
1.
Workflow and Scheduling: Workflow and scheduling may be processed by
Hadoop components like Oozie.
2.
Query and Reporting Capabilities: Query and reporting capabilities could also
include visualization and dashboard capabilities.
3.
Security, Auditing, and Compliance: New incubating projects under the Apache
umbrella address security, auditing, and compliance challenges within a Hadoop
ecosystem. Examples of some of these security components include Apache
Ranger ( http://hortonworks.com/apache/ranger/ ), a Hadoop cluster security
management tool.
4.
Cluster Coordination: Cluster coordination is usually provided by frameworks
such as ZooKeeper and library support for Apache Curator.
5.
Distributed Storage: HDFS is not the only answer to distributed storage. Vendors
like NetApp already use Hadoop connectors to the NFS storage system4.
6.
NoSQL Databases: As we saw in Chapter 4, there are a wide variety of NoSQL
database technologies to choose from, including MongoDB and Cassandra.
Graph databases such as Neo4j and Giraph are also popular NoSQL frameworks
with their own libraries for data transformation, computation, and visualization.
7.
Data Integration Capabilities: Data integration and glueware also continue
to evolve to keep pace with different data formats, legacy programs and data,
relational and NoSQL databases, and data stores such as Solr/Lucene.
8.
Machine Learning: Machine learning and deep learning techniques have
become an important part of the computation module of any BDAs.
9.
Scripting Capabilities: Scripting capabilities in advanced languages such as
Python are developing at a rapid rate, as are interactive shells or REPLs (read-
eval-print loop). Even the venerable Java language includes a REPL in version 9.
10.
Monitoring and System Management: The basic capabilities found in Ganglia,
Nagios, and Ambari for monitoring and managing systems will continue to
evolve. Some of the newer entries for system monitoring and management
include Cloudera Manager ( http://www.cloudera.com/products/cloudera-
manager.html ).
4
See http://www.netapp.com/us/solutions/big-data/nfs-connector-hadoop.aspx for more information about
the NetApp NFS | Hadoop Connector.
269
Chapter 16 ■ Conclusions and the Future of Big Data Analysis
270
Chapter 16 ■ Conclusions and the Future of Big Data Analysis
271
Chapter 16 ■ Conclusions and the Future of Big Data Analysis
* For the latest information on Apache Horn, see the incubation site at https://horn.incubator.
apache.org.
16.8 Final Words
While we’re considering the fate of “Future Hadoop,” let’s keep in mind future issues and challenges that are
facing big data technologies of today.
Some of these challenges are:
1.
Availability of mature predictive analytics: Being able to predict future data from
existing data has always been a goal of business analytics, but much research and
system building remains to be done.
2.
Images and Signals as Big Data Analytics: We dived into the “images as big data”
concept, in Chapter 14 and, as noted there, work is just beginning on these
complex data sources, which of course include time series data and “signals”
from a variety of different sensors, including LIDAR, chemical sensors for
forensic analysis, medical industrial applications, accelerometer and tilt sensor
data from vehicles, and many others.
272
Chapter 16 ■ Conclusions and the Future of Big Data Analysis
3.
Even Bigger Velocity, Variety and Volumes of Input Source Data: As for the
required speed of data processing, the variety and level of structure and
complexity (or the lack of it!), as well as raw volume of data, these requirements
will become more and more demanding as hardware and software become
able to deal with the increased architectural challenges and more demanding
problem sets “future data analysis” will demand.
4.
Combining Disparate Types of Data Sources into a Unified Analysis: “Sensor
fusion” is only one aspect of combining the data into one “unified picture” of
the data landscape being measured and mapped by the sensors. The evolution
of distributed AI and machine learning, and the relatively new area of “deep
learning,” provide potential paths to moving beyond simple aggregation and
fusion of different data sources by providing meaning, context, and prediction
along with the raw data statistical analyses. This will enable sophisticated
model building, data understanding systems, and advanced decision system
applications.
5.
The Merging of Artificial Intelligence (AI) and Big Data Analytics: AI, big data,
and data analytics have always co-existed, even from the earliest history of AI
systems. Advances in distributed machine learning (ML) and deep learning
(DL) have blurred the lines between these areas even more in recent years. Deep
learning libraries, such as Deeplearning4j, are routinely used in BDA applications
these days, and many useful application solutions have been proposed in which
AI components have been integrated seamlessly with BDAs.
6.
Infrastructure and low-level support library evolution (including security):
Infrastructure support toolkits for Hadoop-based applications typically
include Oozie, Azkaban, Schedoscope, and Falcon. Low-level support and
integration libraries include Apache Tika, Apache Camel, Spring Data and Spring
Framework itself, among others. Specialized security components for Hadoop,
Spark, and their ecosystems include Accumulo, Apache Sentry, Apache Knox
Gateway, and many other recent contributions.
It’s a good time to be in the big data analysis arena, whether you are a programmer, architect, manager,
or analyst. Many interesting and game-changing future developments await. Hadoop is often seen as one
stage of evolution to ever more powerful distributed analytic systems, and whether this evolution moves
on to something other than “Hadoop as we know it,” or the Hadoop system we already know evolves its
ecosystem to process more data in better ways, distributed big data analytics is here to stay, and Hadoop is
a major player in the current computing scene. We hope you have enjoyed this survey of big data analysis
techniques using Hadoop as much as we have enjoyed bringing it to you.
273
APPENDIX A
This appendix is a step-by-step guide to setting up a single machine for stand-alone distributed analytics
experimentation and development, using the Hadoop ecosystem and associated tools and libraries.
Of course, in a production-distributed environment, a cluster of server resources might be available
to provide support for the Hadoop ecosystem. Databases, data sources and sinks, and messaging software
might be spread across the hardware installation, especially those components that have a RESTful interface
and may be accessed through URLs. Please see the references listed at the end of the Appendix for a
thorough explanation of how to configure Hadoop, Spark, Flink, and Phoenix, and be sure to refer to the
appropriate info pages online for current information about these support components.
Most of the instructions given here are hardware agnostic. The instructions are especially suited,
however, for a MacOS environment.
A last note about running Hadoop based programs in a Windows environment: While this is
possible and is sometimes discussed in the literature and online documentation, most components are
recommended to run in a Linux or MacOS based environment.
277
Appendix A ■ Setting Up the Distributed Analytics Environment
Once the initial basic components, such as Java, Maven, and your favorite IDE are installed, the other
components may be gradually added to the system as you configure and test it, as discussed in the following
sections.
278
Appendix A ■ Setting Up the Distributed Analytics Environment
Figure A-1. First step: validate Java is in place and has the correct version
Next, download the Eclipse IDE from the Eclipse web site. Please note, we used the “Mars” version of
the IDE for the development described in this book.
http://www.eclipse.org/downloads/packages/eclipse-ide-java-ee-developers/marsr
Finally, download the Maven-compressed version from the Maven web site https://maven.apache.
org/download.cgi .
Validate correct Maven installation with
mvn --version
On the command line, you should see a result similar to the terminal output in Figure A-2.
279
Appendix A ■ Setting Up the Distributed Analytics Environment
ssh localhost
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
There are many online documents with complete instructions on using ssh with Hadoop appropriately,
as well as several of the standard Hadoop references.
280
Appendix A ■ Setting Up the Distributed Analytics Environment
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
also to hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name >
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
</configuration>
A sample configuration file for Zookeeper is provided with the download. Place the appropriate
configuration values in the file conf/zoo.cfg .
Start the Zookeeper server with the command
bin/zkServer.sh start
281
Appendix A ■ Setting Up the Distributed Analytics Environment
Run the Zookeeper CLI (REPL) to make sure you can do simple operations using Zookeeper, as in Figure A-3.
282
Appendix A ■ Setting Up the Distributed Analytics Environment
Try some simple commands in the Zookeeper CLI to insure it’s functioning properly. Executing
ls /
create /zk_test my_data
get /zk_test
mv metastore_db metastore_db.tmp
on the command line. The successful result will be similar to that shown in Figure A-7.
284
Appendix A ■ Setting Up the Distributed Analytics Environment
285
Appendix A ■ Setting Up the Distributed Analytics Environment
13.
Send some messages from the console to test the messaging sending
functionality. Type:
14.
bin/kafka-console-producer.sh –broker-list localhost:9092 –topic
ProHadoopBDA0 Now type some messages into the console.
15.
You can configure a multi-broker cluster by modifying the appropriate config
files. Check the Apache Kafka documentation for step-by-step processes how to
do this.
286
Appendix A ■ Setting Up the Distributed Analytics Environment
Figure A-9. Successful running of Hadoop configuration script and test with printenv
Use “printenv” on the command line to verify default environment variable settings on start-up of a
terminal window, as shown in Figure A-9.
References
Liu, Henry H. Spring 4 for Developing Enterprise Applications: An End-to-End Approach. PerfMath,
http://www.perfmath.com. Apparently self-published, 2014.
Venner, David. Pro Hadoop. New York, NY: Apress Publishing, 2009.
288
APPENDIX B
The example system supplied with this book is a standard Maven project and may be used with a standard
Java development IDE, such as Eclipse, IntelliJ, or NetBeans. All the required dependencies are included
in the top-level pom.xml file. Download the compressed project from the URL indicated. Uncompress and
import the project into your favorite IDE. Refer to the README file included with the example system for
additional version and configuration information, as well as additional troubleshooting tips and up-to-date
URL pointers. The current version information of many of the software components can be found in the
VERSION text file accompanying the software.
Some standard infrastructure components such as databases, build tools (such as Maven itself,
appropriate version of Java, and the like), and optional components (such as some of the computer vision–
related “helper” libraries) must be installed first on a new system before successfully using the project.
Components such as Hadoop, Spark, Flink, and ZooKeeper should run independently, and the environment
variables for these must be set correctly (HADOOP_HOME, SPARK_HOME, etc.). Please refer to some of the
references given below to install standard software components such as Hadoop.
In particular, check your environment variable PROBDA_HOME by doing a “printenv” command on
the command line, or its equivalent.
For required environment variable settings and their default values, please refer to Appendix A.
Run the system by executing the Maven command on the command line after cd’ing to the source
directory.
cd $PROBDA_HOME
mvn clean install -DskipTests
A
Apache Katta
configuration, 96–97
Algorithm initialization, 96–97
coding, examples, 146 installation, 96–97
survey, 139–141 solr-based distributed data pipelining
types, 139–141 architecture, 96
Anaconda Python system Apache Kylin, 74
initial installer diagram, 86 Apache Lens (lens.apache.org)
installation, 87 Apache Zeppelin, 72
Analytical engine architecture diagram, 70–71
rule control, 160 installed successfully using
Analytic applications, 267, 268 Maven on MacOSX, 71
Angular JS login page, 72
configuration file, 191–192 OLAP commands, 71, 74
console result, 194 REPL, 71–72
d3.js, 197 zipped TAR file, 71
directories and files, 187–188 Apache Lenya, 268
elasticUI, 186 Apache Lucene, 16
example system, 187 Apache Mahout
graph database, 198 classification algorithms, 54
handcraft user interfaces, 199 and Hadoop-based machine
JHipster, 186 learning packages, 54
Maven stub, 190 software frameworks, 54
Neo4j, 199 in visualization, 55
npm initialization, 188–189 Vowpal Wabbit, 54
package.json file, 195 Apache Maven, 44–45
sigma.js-based graph visualization, 198 Apache MRUnit, 61
./src/main/webapp/WEB-INF/ Apache NiFi, 268
beans.xml, 195–197 Apache Phoenix, 18
./src/test/javascript/karma.conf.js, 193 Apache POI, 182
ANSI SQL interface and Apache software components, 267
multi-dimensional analysis, 54 Apache Solr, 16
Apache Beam, 80, 268 Apache Spark, 8, 10–11, 13, 18, 22–26, 80
Apache Bigtop, 61–62 Apache Spark applications, 73
Apache Calcite, 74–75 Apache Spark-centric technology stack, 142
Apache Cassandra, 73 Apache Spark libraries
Apache Falcon, 82 and components
Apache Flink, 80 different shells to choose from, 56
Apache Hadoop, 80, 268 Sparkling Water (h20.ai), 58
Apache Hadoop Setup, 280 H20 Machine Learning, 58
Apache Kafka, 12, 80 Python interactive shell, 56
Apache Kafka messaging system, 60 streaming, 57
295
■ INDEX
296
■ INDEX
297
■ INDEX
T
Y
Tachyon-centric technology stack, 142 Yet Another Resource Negotiator (YARN), 5, 17
Taxonomy business annotations, 267 Young projects, 267
TinkerPop 3, 65
Troubleshooting and FAQ information, 289
Two-dimensional arrays, 11 Z
Zeppelin
NoSQL components, 73
U
notebook application, 72
Unified analytics (UA) and Scala, 72
Hadoop components SQL, 72
Apache Calcite, 74 successfully running, Zeppelin browser UI, 72
Apache Kylin, 74 successful Maven build, Zeppelin notebook, 73
Apache Lens (see Apache Lens) Zeppelin-Lens-Cassandra architecture, 74
HSQLDB installation, 74 Zipped TAR file, 71
OLAP, 74 ZooKeeper, 289
298