Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $9.99/month after trial. Cancel anytime.

Streaming Data: Understanding the real-time pipeline
Streaming Data: Understanding the real-time pipeline
Streaming Data: Understanding the real-time pipeline
Ebook398 pages4 hours

Streaming Data: Understanding the real-time pipeline

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Summary

Streaming Data introduces the concepts and requirements of streaming and real-time data systems. The book is an idea-rich tutorial that teaches you to think about how to efficiently interact with fast-flowing data.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the Technology

As humans, we're constantly filtering and deciphering the information streaming toward us. In the same way, streaming data applications can accomplish amazing tasks like reading live location data to recommend nearby services, tracking faults with machinery in real time, and sending digital receipts before your customers leave the shop. Recent advances in streaming data technology and techniques make it possible for any developer to build these applications if they have the right mindset. This book will let you join them.

About the Book

Streaming Data is an idea-rich tutorial that teaches you to think about efficiently interacting with fast-flowing data. Through relevant examples and illustrated use cases, you'll explore designs for applications that read, analyze, share, and store streaming data. Along the way, you'll discover the roles of key technologies like Spark, Storm, Kafka, Flink, RabbitMQ, and more. This book offers the perfect balance between big-picture thinking and implementation details.

What's Inside

  • The right way to collect real-time data
  • Architecting a streaming pipeline
  • Analyzing the data
  • Which technologies to use and when

About the Reader

Written for developers familiar with relational database concepts. No experience with streaming or real-time applications required.

About the Author

Andrew Psaltis is a software engineer focused on massively scalable real-time analytics.

Table of Contents

    PART 1 - A NEW HOLISTIC APPROACH
  1. Introducing streaming data
  2. Getting data from clients: data ingestion
  3. Transporting the data from collection tier: decoupling the data pipeline
  4. Analyzing streaming data
  5. Algorithms for data analysis
  6. Storing the analyzed or collected data
  7. Making the data available
  8. Consumer device capabilities and limitations accessing the data
  9. PART 2 - TAKING IT REAL WORLD
  10. Analyzing Meetup RSVPs in real time
LanguageEnglish
PublisherManning
Release dateMay 31, 2017
ISBN9781638357247
Streaming Data: Understanding the real-time pipeline

Related to Streaming Data

Related ebooks

Computers For You

View More

Related articles

Reviews for Streaming Data

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Streaming Data - Andrew Psaltis

    Copyright

    For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact

           Special Sales Department

           Manning Publications Co.

           20 Baldwin Road

           PO Box 761

           Shelter Island, NY 11964

           Email: 

    [email protected]

    ©2017 by Manning Publications Co. All rights reserved.

    No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.

    Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.

    Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.

    Development editor: Karen Miller

    Technical development editor: Gregor Zurowski

    Project editor: Janet Vail

    Copyeditor: Corbin Collins

    Proofreader: Elizabeth Martin

    Technical proofreader: Al Krinker

    Typesetter: Dennis Dalinnik

    Cover designer: Marija Tudor

    ISBN: 9781617292286

    Printed in the United States of America

    1 2 3 4 5 6 7 8 9 10 – EBM – 22 21 20 19 18 17

    Brief Table of Contents

    Copyright

    Brief Table of Contents

    Table of Contents

    Preface

    Acknowledgments

    About this Book

    1. A new holistic approach

    Chapter 1. Introducing streaming data

    Chapter 2. Getting data from clients: data ingestion

    Chapter 3. Transporting the data from collection tier: decoupling the data pipeline

    Chapter 4. Analyzing streaming data

    Chapter 5. Algorithms for data analysis

    Chapter 6. Storing the analyzed or collected data

    Chapter 7. Making the data available

    Chapter 8. Consumer device capabilities and limitations accessing the data

    2. Taking it real world

    Chapter 9. Analyzing Meetup RSVPs in real time

     The streaming data architectural blueprint

    Index

    List of Figures

    List of Tables

    List of Listings

    Table of Contents

    Copyright

    Brief Table of Contents

    Table of Contents

    Preface

    Acknowledgments

    About this Book

    1. A new holistic approach

    Chapter 1. Introducing streaming data

    1.1. What is a real-time system?

    1.2. Differences between real-time and streaming systems

    1.3. The architectural blueprint

    1.4. Security for streaming systems

    1.5. How do we scale?

    1.6. Summary

    Chapter 2. Getting data from clients: data ingestion

    2.1. Common interaction patterns

    2.1.1. Request/response pattern

    2.1.2. Request/acknowledge pattern

    2.1.3. Publish/subscribe pattern

    2.1.4. One-way pattern

    2.1.5. Stream pattern

    2.2. Scaling the interaction patterns

    2.2.1. Request/response optional pattern

    2.2.2. Scaling the stream pattern

    2.3. Fault tolerance

    2.3.1. Receiver-based message logging

    2.3.2. Sender-based message logging

    2.3.3. Hybrid message logging

    2.4. A dose of reality

    2.5. Summary

    Chapter 3. Transporting the data from collection tier: decoupling the data pipeline

    3.1. Why we need a message queuing tier

    3.2. Core concepts

    3.2.1. The producer, the broker, and the consumer

    3.2.2. Isolating producers from consumers

    3.2.3. Durable messaging

    3.2.4. Message delivery semantics

    3.3. Security

    3.4. Fault tolerance

    3.5. Applying the core concepts to business problems

    Finance: fraud detection

    Internet of Things: a tweeting Coke machine

    E-commerce: product recommendations

    3.6. Summary

    Chapter 4. Analyzing streaming data

    4.1. Understanding in-flight data analysis

    4.2. Distributed stream-processing architecture

    A generalized architecture

    Apache Spark Streaming

    Apache Storm

    Apache Flink

    Apache Samza

    4.3. Key features of stream-processing frameworks

    4.3.1. Message delivery semantics

    State management

    Fault tolerance

    4.4. Summary

    Chapter 5. Algorithms for data analysis

    5.1. Accepting constraints and relaxing

    5.2. Thinking about time

    Stream time vs. event time

    Windows of time

    5.2.1. Sliding window

    Example usage

    Framework support

    5.2.2. Tumbling window

    Example use

    Framework support

    5.3. Summarization techniques

    5.3.1. Random sampling

    5.3.2. Counting distinct elements

    5.3.3. Frequency

    5.3.4. Membership

    5.4. Summary

    Chapter 6. Storing the analyzed or collected data

    6.1. When you need long-term storage

    Direct writing

    Indirect writing

    6.2. Keeping it in-memory

    6.2.1. Embedded in-memory/flash-optimized

    6.2.2. Caching system

    Read-through

    Refresh-ahead

    Write-through

    Write-around

    Write-back (write-behind)

    6.2.3. In-memory database and in-memory data grid

    6.3. Use case exercises

    6.3.1. In-session personalization

    Embedded in-memory/flash-optimized

    Caching system

    IMDB or IMDG

    Taking it to the next level

    6.3.2. Next-generation energy company

    6.4. Summary

    Chapter 7. Making the data available

    7.1. Communications patterns

    7.1.1. Data Sync

    Benefits

    Drawbacks

    7.1.2. Remote Method Invocation and Remote Procedure Call

    Benefits

    Drawbacks

    7.1.3. Simple Messaging

    Benefits

    Drawbacks

    7.1.4. Publish-Subscribe

    Benefits

    Drawbacks

    7.2. Protocols to use to send data to the client

    7.2.1. Webhooks

    7.2.2. HTTP Long Polling

    7.2.3. Server-sent events

    7.2.4. WebSockets

    7.3. Filtering the stream

    7.3.1. Where to filter

    7.3.2. Static vs. dynamic filtering

    7.4. Use case: building a Meetup RSVP streaming API

    7.5. Summary

    Chapter 8. Consumer device capabilities and limitations accessing the data

    8.1. The core concepts

    UI/end-user application

    Integration with third-party/stream processors

    8.1.1. Reading fast enough

    Third-party streaming API

    Your streaming API

    8.1.2. Maintaining state

    8.1.3. Mitigating data loss

    8.1.4. Exactly-once processing

    8.2. Making it real: SuperMediaMarkets

    8.3. Introducing the web client

    8.3.1. Integrating with the streaming API service

    8.4. The move toward a query language

    8.5. Summary

    2. Taking it real world

    Chapter 9. Analyzing Meetup RSVPs in real time

    9.1. The collection tier

    9.1.1. Collection service data flow

    9.2. Message queuing tier

    9.2.1. Installing and configuring Kafka

    9.2.2. Integrating the collection service and Kafka

    9.3. Analysis tier

    9.3.1. Installing Storm and preparing Kafka

    9.3.2. Building the top n Storm topology

    9.3.3. Integrating analysis

    9.4. In-memory data store

    9.5. Data access tier

    9.5.1. Taking it to production

    9.6. Summary

     The streaming data architectural blueprint

    Index

    List of Figures

    List of Tables

    List of Listings

    Preface

    For as long as I can remember, I have been fascinated with speed as it relates to computing and am always trying to find a way to do something faster. In the late 1990s, when I spent most of my time writing software in C++, my favorite keyword was __asm, which means the following block of code is in assembly language, and I understood what was happening at the machine level. I worked on mobile software in the early 2000s and again the story was how could we sync data faster or make things run faster on the PalmPilots and Windows CE devices we were using? At the time we had huge (by that day’s standards, anyway) medical databases (around 25–50 MB in size) that required external cards on a PalmPilot to store and several applications that needed to provide interactive speed when searching and browsing the data.

    As data volumes started to grow in the industries I was working in, I found myself at the perfect intersection of large data sets and speed to business insight. The data was growing in volume and being generated at faster and faster speeds, and business wanted answers to questions in shorter and shorter timeframes from the time data was being generated. To me, it was the perfect marriage: large data and a need for speed. Around 2001 I began to work on marketing analytics and e-commerce applications, where data was continuously being updated and we needed to provide insight into it in near real time. In 2009 I started working at Webtrends, where my love for speed and delivering insight at speed really matured. At Webtrends, analytics was our core business, and the idea of real-time analytics was just starting to catch the interest of our customers. The first project I worked on aimed to deliver key metrics in a dashboard within five minutes of a clickstream event happening anywhere in the world. At the time, that was pushing the envelope.

    In 2011 I was part of an emerging products team. Our mission was to continue to push the idea of real-time analytics and try to disrupt our industry. After spending time researching, prototyping, and thinking through our next step, a perfect storm occurred. We had been looking at Apache Kafka, and then in September 2011 Apache Storm was open sourced. We immediately started to run like crazy with it. By winter we had early-adopter customers looking at what we were building. At that point we never looked back and set our sights on delivering on a Service Level Agreement (SLA) that was, in essence: From click to dashboard in three seconds or less, globally! After many months and a lot of work by what became a much larger team, we delivered on our promise and won the Digital Analytics New Technology of the Year award (www.digitalanalyticsassociation.org/awards2013). I was deeply involved in building and architecting all aspects of this solution, from the data collection to the initial UI (which was affectionately called Bare Bones, due to my lack of UI skills).

    We continued our pursuit and began looking at Spark Streaming when it was still part of the Berkley AMPLab. Since those days I have continued to pursue building more and more streaming systems that deliver on the ultimate goal of delivering insights at the speed of thought. Today I continue to speak internationally on the topic and work with companies across the globe in designing, building, and solving streaming problems.

    Even today I still see a widespread lack of understanding of all the pieces that go into building and delivering a streaming system. You can usually find references to pieces of the stack, but rarely do you find out how to think through the entire stack and understand each of the tiers.

    It is therefore with great pleasure that I have tried in this book to share and distill this real-world experience and knowledge. My goal has been to provide a solid foundation from which you can build and explore a complete streaming system.

    Acknowledgments

    First, I want to thank my family for their support during the writing of this book. There were many weekends and nights of Sorry, I can’t help with the garden (or play lacrosse or go to the get-together)—I need to write. I’m sure that wasn’t easy for my children to hear; nor was it always easy for my wife to buffer and pick up my slack. Through all the highs and lows that go into this process their support never wavered and they remained a constant source of encouragement and inspiration. For this I owe a tremendous debt of gratitude to my wife and children; a simple thank you cannot express it enough.

    Thanks to Karen, my development editor, for her endless patience, understanding, and willingness to always talk things through with me throughout this entire journey. To Robin, my acquisition editor, for believing in me, nurturing the idea of this book, and being a sounding board to make sure the train was staying on the tracks during some rough patches in the early days. To Bert, for his teachings on how to tell a story, how to find the right level of depth with a narrative, and pedagogical insight into the construction of a technical book. To my technical development editor Gregor, whose very thoughtful and insightful feedback helped craft this book into what it is today. Lastly, but certainly not least, thanks to the entire Manning team for the fantastic effort to finally get us to this point.

    Thanks also to all the people who bought and read early versions of the manuscript through the MEAP early access program, to those who contributed to the Author Online forum, and to the countless reviewers for their invaluable feedback, including Andrew Gibson, Dr. Tobias Bürger, Jake McCrary, Rodrigo Abreu, Andy Keffalas, John Guthrie, Kosmas Chatzimichalis, Giuliano Bertoti, Carlos Curotto, Andy Kirsch, Douglas Duncan, Jeff Smith, and Sergio Fernández González, Jaromir D.B. Nemec, Jose Samonte, Jan Nonnen, Romit Singhai, Chris Allan, Jonathan Thoms, Steven Jenkins, Lee Gilbert, Amandeep Khurana, Charlie Gaines. Without all of you, this book wouldn’t be what it is today.

    Many others contributed in various different ways. I can’t mention everyone by name because the acknowledgments would just roll on and on, but a big thank you goes out to everyone else who had a hand in helping make this possible!

    About this Book

    The world of real-time systems has been around for a long time; for many years real-time and/or streaming was solely the domain of hardware real-time systems. Those are systems where if an SLA isn’t met, there is potential loss of life. Over the last decade near-real-time systems have emerged and grown at an amazing rate. Everywhere you look you can find examples of data streaming: social media, games, smart cities, smart meters, your new washing machine, and the list goes on. Consider the following: Today if a byte of data were a gallon of water, an average home would be filled within 10 seconds; by the year 2020, it will only take 2 seconds. Making sense of and using such a deluge of data means building streaming systems.

    Focusing on the big ideas of streaming and real-time data, the goals of this book are two-fold: The first objective is to teach you how to think about the entire pipeline so you’re equipped with the skills to not only build a streaming system but also understand the tradeoffs at every tier. Secondly, this book is meant to provide a solid launching point for you to delve deeper into each tier, as your business needs require or as your interest pulls you.

    How to use this book

    Although this book was designed to read from start to finish, each chapter provides enough information so that you can read and understood it on its own. Therefore if want to understand a particular tier, you should feel comfortable jumping straight to that chapter and then using what you learned there as your base for deeper exploration of the other chapters.

    Who should read this book

    This book is perfect for developers or architects and has been written to be easily accessible to technical managers and business decision makers—no prior experience with streaming or real-time data systems required. The only technical requirement this book makes is that you should feel comfortable reading Java. The source code is written in Java, as is the example code that accompanies chapter 9

    Roadmap

    The roadmap of this book is represented in figure 1. A synopsis of each chapter follows.

    Figure 1. Architectural blueprint with chapter mappings

    Chapter 1 introduces the architectural blueprint of the book, which tells you where we are in the pipeline and serves as a great map if you need to jump from tier to tier. After laying out this blueprint, chapter 1 defines a real-time system, explores the differences between real-time and in-the-moment systems, and briefly touches on the importance of security (which could be its own book).

    Chapter 2 explores all aspects of collecting data for a streaming system, from the interaction patterns through scaling and fault-tolerance techniques. This chapter covers all the relevant aspects of the collection tier and prepares you to build a scalable and reliable tier.

    Chapter 3 is all about how to decouple the data being collected from the data being analyzed by using a message queuing tier in the middle. You will learn why you need a message queuing tier, how to understand message durability and different message delivery semantics, and how to choose the right technology for your business problem.

    Chapter 4 dives into the common architectural patterns of distributed stream-processing frameworks, covering topics such as what message delivery semantics mean for this tier, how state is commonly handled, and what fault tolerance is and why we need it.

    Chapter 5 jumps from discussing architecture to querying a stream, the problems with time, and the four popular summarization techniques. If chapter 4 is the what for distributed stream-processing engines, chapter 5 is the how.

    Chapter 6 discusses options for storing data in-memory during and post analysis. It doesn’t spend much time discussing disk-based long-term storage solutions because they’re often used out of band of a streaming analysis and don’t offer the performance of the in-memory stores.

    Chapter 7 is where we start to discuss what to do with the data we have collected and analyzed. It talks about communications patterns and protocols used for sending data to a streaming client. Along the way we’ll find out how to match up our business requirements to the various protocols and how to choose the right one.

    Chapter 8 explores concepts to keep in mind when building a streaming client. This is not a chapter on just building an HTML web app; it goes much deeper into lower-level things to consider when designing the client side of a streaming system.

    Chapter 9 . . . at this point, if you have read all the way through, congrats! A lot of material is covered in the first eight chapters. Chapter 9 is where we make it all come to life. Here we build a complete streaming data pipeline and discuss taking our sample to production.

    About the code

    All the code shown in the final chapter of this book can be found in the sample source code that accompanies this book. You can download the sample code free of charge from the Manning website at www.manning.com/books/streaming-data. You may also find the code on GitHub at https://github.com/apsaltis/StreamingData-Book-Examples.

    The sample code is structured as separate Maven projects, one for each of the tiers we walk through in chapter 9. Instructions for building and running the software are provided during the walkthrough in chapter 9.

    All source code in listings or in the text is in a fixed-width font like this to separate it from ordinary text. In some listings, the code is annotated to point out the key concepts.

    About the author

    Andrew Psaltis is deeply entrenched in streaming systems and obsessed with delivering insight at the speed of thought. He spends most of his waking hours thinking about, writing about, and building streaming systems. He helps customers of all sizes build and/or fix complex streaming systems, speaks around the globe about streaming, and teaches others how to build streaming systems. When he’s not busy being busy, he’s spending time with his lovely wife, two kids, and watching as much lacrosse as possible.

    Author Online

    The purchase of Streaming Data includes free access to a private forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and other users. To access and subscribe to the forum, point your browser to www.manning.com/books/streaming-data. This page provides information on how to get on the forum once you’re registered, what kind of help is available, and the rules of conduct in the forum.

    Manning’s commitment to our readers is to provide a venue where meaningful dialogue between individual readers and between readers and the author can take place. It’s not a commitment to any specific amount of participation on the part of the author, whose contribution to the book’s forum remains voluntary (and unpaid). We suggest you try asking him challenging questions, lest his interest stray!

    The Author Online forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

    About the cover illustration

    The figure on the cover of Streaming Data is captioned Habit of a Moor of Morrocco in winter in 1695. The illustration is taken from Thomas Jefferys’ A Collection of the Dresses of Different Nations, Ancient and Modern (four volumes), London, published between 1757 and 1772. The title page states that these are hand-colored copperplate engravings, heightened with gum arabic. Thomas Jefferys (1719–1771) was called Geographer to King George III. He was an English cartographer who was the leading map supplier of his day. He engraved and printed maps for government and other official bodies and produced a wide range of commercial maps and atlases, especially of North America. His work as a mapmaker sparked an interest in local dress customs of the lands he surveyed and mapped, which are brilliantly displayed in this collection.

    Fascination with faraway lands and travel for pleasure were relatively new phenomena in the late 18th century and collections such as this one were popular, introducing both the tourist as well as the armchair traveler to the inhabitants of other countries. The diversity of the drawings in Jefferys’ volumes speaks vividly of the uniqueness and individuality of the world’s nations some 200 years ago. Dress codes have changed since then and the diversity by region and country, so rich at the time, has faded away. It is now often hard to tell the inhabitant of one continent from another. Perhaps, trying to view it optimistically, we have traded a cultural and visual diversity for a more varied personal life. Or a more varied and interesting intellectual and technical life.

    At a time when it is hard to tell one computer book from

    Enjoying the preview?
    Page 1 of 1