Streaming Data: Understanding the real-time pipeline
()
About this ebook
Streaming Data introduces the concepts and requirements of streaming and real-time data systems. The book is an idea-rich tutorial that teaches you to think about how to efficiently interact with fast-flowing data.
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the Technology
As humans, we're constantly filtering and deciphering the information streaming toward us. In the same way, streaming data applications can accomplish amazing tasks like reading live location data to recommend nearby services, tracking faults with machinery in real time, and sending digital receipts before your customers leave the shop. Recent advances in streaming data technology and techniques make it possible for any developer to build these applications if they have the right mindset. This book will let you join them.
About the Book
Streaming Data is an idea-rich tutorial that teaches you to think about efficiently interacting with fast-flowing data. Through relevant examples and illustrated use cases, you'll explore designs for applications that read, analyze, share, and store streaming data. Along the way, you'll discover the roles of key technologies like Spark, Storm, Kafka, Flink, RabbitMQ, and more. This book offers the perfect balance between big-picture thinking and implementation details.
What's Inside
- The right way to collect real-time data
- Architecting a streaming pipeline
- Analyzing the data
- Which technologies to use and when
About the Reader
Written for developers familiar with relational database concepts. No experience with streaming or real-time applications required.
About the Author
Andrew Psaltis is a software engineer focused on massively scalable real-time analytics.
Table of Contents
-
PART 1 - A NEW HOLISTIC APPROACH
- Introducing streaming data
- Getting data from clients: data ingestion
- Transporting the data from collection tier: decoupling the data pipeline
- Analyzing streaming data
- Algorithms for data analysis
- Storing the analyzed or collected data
- Making the data available
- Consumer device capabilities and limitations accessing the data PART 2 - TAKING IT REAL WORLD
- Analyzing Meetup RSVPs in real time
Related to Streaming Data
Related ebooks
Designing Cloud Data Platforms Rating: 0 out of 5 stars0 ratingsGrokking Streaming Systems: Real-time event processing Rating: 5 out of 5 stars5/5Cloud Native Patterns: Designing change-tolerant software Rating: 4 out of 5 stars4/5Serverless Architectures on AWS: With examples using AWS Lambda Rating: 0 out of 5 stars0 ratingsGoogle Cloud Platform in Action Rating: 0 out of 5 stars0 ratingsGo in Practice Rating: 5 out of 5 stars5/5Re-Engineering Legacy Software Rating: 0 out of 5 stars0 ratingsVisualizing Graph Data Rating: 0 out of 5 stars0 ratingsMLOps Engineering at Scale Rating: 0 out of 5 stars0 ratingsMachine Learning Engineering in Action Rating: 0 out of 5 stars0 ratingsServerless Architectures on AWS, Second Edition Rating: 5 out of 5 stars5/5Infrastructure as Code, Patterns and Practices: With examples in Python and Terraform Rating: 0 out of 5 stars0 ratingsOperations Anti-Patterns, DevOps Solutions Rating: 0 out of 5 stars0 ratingsXamarin in Action: Creating native cross-platform mobile apps Rating: 0 out of 5 stars0 ratingsSpark in Action: Covers Apache Spark 3 with Examples in Java, Python, and Scala Rating: 0 out of 5 stars0 ratingsAI as a Service: Serverless machine learning with AWS Rating: 1 out of 5 stars1/5Graph-Powered Machine Learning Rating: 0 out of 5 stars0 ratingsKafka in Action Rating: 0 out of 5 stars0 ratingsData Engineering on Azure Rating: 0 out of 5 stars0 ratingsAWS Lambda in Action: Event-driven serverless applications Rating: 0 out of 5 stars0 ratingsReal-Time Big Data Analytics Rating: 5 out of 5 stars5/5Go Web Programming Rating: 5 out of 5 stars5/5Node.js in Practice Rating: 0 out of 5 stars0 ratingsAmazon Web Services in Action Rating: 0 out of 5 stars0 ratingsImplementing Cloud Design Patterns for AWS Rating: 0 out of 5 stars0 ratingsSpring Boot in Action Rating: 0 out of 5 stars0 ratingsAlgorithms of the Intelligent Web Rating: 0 out of 5 stars0 ratingsIrresistible APIs: Designing web APIs that developers will love Rating: 0 out of 5 stars0 ratingsSpring Microservices in Action Rating: 0 out of 5 stars0 ratingsMachine Learning Systems: Designs that scale Rating: 0 out of 5 stars0 ratings
Computers For You
Storytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Algorithms to Live By: The Computer Science of Human Decisions Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5Data Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Get Into UX: A foolproof guide to getting your first user experience job Rating: 4 out of 5 stars4/5Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I. Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Good Code, Bad Code: Think like a software engineer Rating: 5 out of 5 stars5/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5The Alignment Problem: How Can Machines Learn Human Values? Rating: 4 out of 5 stars4/5Learn Algorithmic Trading: Build and deploy algorithmic trading systems and strategies using Python and advanced data analysis Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Master Obsidian Quickly: Boost Your Learning & Productivity with a Free, Modern, Powerful Knowledge Toolkit Rating: 4 out of 5 stars4/5Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning Rating: 5 out of 5 stars5/5Learning the Chess Openings Rating: 5 out of 5 stars5/5UX/UI Design Playbook Rating: 4 out of 5 stars4/5Blender 3D Basics Beginner's Guide Second Edition Rating: 5 out of 5 stars5/5ChatGPT Rating: 3 out of 5 stars3/5
Reviews for Streaming Data
0 ratings0 reviews
Book preview
Streaming Data - Andrew Psaltis
Copyright
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email:
©2017 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
Development editor: Karen Miller
Technical development editor: Gregor Zurowski
Project editor: Janet Vail
Copyeditor: Corbin Collins
Proofreader: Elizabeth Martin
Technical proofreader: Al Krinker
Typesetter: Dennis Dalinnik
Cover designer: Marija Tudor
ISBN: 9781617292286
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – EBM – 22 21 20 19 18 17
Brief Table of Contents
Copyright
Brief Table of Contents
Table of Contents
Preface
Acknowledgments
About this Book
1. A new holistic approach
Chapter 1. Introducing streaming data
Chapter 2. Getting data from clients: data ingestion
Chapter 3. Transporting the data from collection tier: decoupling the data pipeline
Chapter 4. Analyzing streaming data
Chapter 5. Algorithms for data analysis
Chapter 6. Storing the analyzed or collected data
Chapter 7. Making the data available
Chapter 8. Consumer device capabilities and limitations accessing the data
2. Taking it real world
Chapter 9. Analyzing Meetup RSVPs in real time
The streaming data architectural blueprint
Index
List of Figures
List of Tables
List of Listings
Table of Contents
Copyright
Brief Table of Contents
Table of Contents
Preface
Acknowledgments
About this Book
1. A new holistic approach
Chapter 1. Introducing streaming data
1.1. What is a real-time system?
1.2. Differences between real-time and streaming systems
1.3. The architectural blueprint
1.4. Security for streaming systems
1.5. How do we scale?
1.6. Summary
Chapter 2. Getting data from clients: data ingestion
2.1. Common interaction patterns
2.1.1. Request/response pattern
2.1.2. Request/acknowledge pattern
2.1.3. Publish/subscribe pattern
2.1.4. One-way pattern
2.1.5. Stream pattern
2.2. Scaling the interaction patterns
2.2.1. Request/response optional pattern
2.2.2. Scaling the stream pattern
2.3. Fault tolerance
2.3.1. Receiver-based message logging
2.3.2. Sender-based message logging
2.3.3. Hybrid message logging
2.4. A dose of reality
2.5. Summary
Chapter 3. Transporting the data from collection tier: decoupling the data pipeline
3.1. Why we need a message queuing tier
3.2. Core concepts
3.2.1. The producer, the broker, and the consumer
3.2.2. Isolating producers from consumers
3.2.3. Durable messaging
3.2.4. Message delivery semantics
3.3. Security
3.4. Fault tolerance
3.5. Applying the core concepts to business problems
Finance: fraud detection
Internet of Things: a tweeting Coke machine
E-commerce: product recommendations
3.6. Summary
Chapter 4. Analyzing streaming data
4.1. Understanding in-flight data analysis
4.2. Distributed stream-processing architecture
A generalized architecture
Apache Spark Streaming
Apache Storm
Apache Flink
Apache Samza
4.3. Key features of stream-processing frameworks
4.3.1. Message delivery semantics
State management
Fault tolerance
4.4. Summary
Chapter 5. Algorithms for data analysis
5.1. Accepting constraints and relaxing
5.2. Thinking about time
Stream time vs. event time
Windows of time
5.2.1. Sliding window
Example usage
Framework support
5.2.2. Tumbling window
Example use
Framework support
5.3. Summarization techniques
5.3.1. Random sampling
5.3.2. Counting distinct elements
5.3.3. Frequency
5.3.4. Membership
5.4. Summary
Chapter 6. Storing the analyzed or collected data
6.1. When you need long-term storage
Direct writing
Indirect writing
6.2. Keeping it in-memory
6.2.1. Embedded in-memory/flash-optimized
6.2.2. Caching system
Read-through
Refresh-ahead
Write-through
Write-around
Write-back (write-behind)
6.2.3. In-memory database and in-memory data grid
6.3. Use case exercises
6.3.1. In-session personalization
Embedded in-memory/flash-optimized
Caching system
IMDB or IMDG
Taking it to the next level
6.3.2. Next-generation energy company
6.4. Summary
Chapter 7. Making the data available
7.1. Communications patterns
7.1.1. Data Sync
Benefits
Drawbacks
7.1.2. Remote Method Invocation and Remote Procedure Call
Benefits
Drawbacks
7.1.3. Simple Messaging
Benefits
Drawbacks
7.1.4. Publish-Subscribe
Benefits
Drawbacks
7.2. Protocols to use to send data to the client
7.2.1. Webhooks
7.2.2. HTTP Long Polling
7.2.3. Server-sent events
7.2.4. WebSockets
7.3. Filtering the stream
7.3.1. Where to filter
7.3.2. Static vs. dynamic filtering
7.4. Use case: building a Meetup RSVP streaming API
7.5. Summary
Chapter 8. Consumer device capabilities and limitations accessing the data
8.1. The core concepts
UI/end-user application
Integration with third-party/stream processors
8.1.1. Reading fast enough
Third-party streaming API
Your streaming API
8.1.2. Maintaining state
8.1.3. Mitigating data loss
8.1.4. Exactly-once processing
8.2. Making it real: SuperMediaMarkets
8.3. Introducing the web client
8.3.1. Integrating with the streaming API service
8.4. The move toward a query language
8.5. Summary
2. Taking it real world
Chapter 9. Analyzing Meetup RSVPs in real time
9.1. The collection tier
9.1.1. Collection service data flow
9.2. Message queuing tier
9.2.1. Installing and configuring Kafka
9.2.2. Integrating the collection service and Kafka
9.3. Analysis tier
9.3.1. Installing Storm and preparing Kafka
9.3.2. Building the top n Storm topology
9.3.3. Integrating analysis
9.4. In-memory data store
9.5. Data access tier
9.5.1. Taking it to production
9.6. Summary
The streaming data architectural blueprint
Index
List of Figures
List of Tables
List of Listings
Preface
For as long as I can remember, I have been fascinated with speed as it relates to computing and am always trying to find a way to do something faster. In the late 1990s, when I spent most of my time writing software in C++, my favorite keyword was __asm, which means the following block of code is in assembly language,
and I understood what was happening at the machine level. I worked on mobile software in the early 2000s and again the story was how could we sync data faster or make things run faster on the PalmPilots and Windows CE devices we were using? At the time we had huge (by that day’s standards, anyway) medical databases (around 25–50 MB in size) that required external cards on a PalmPilot to store and several applications that needed to provide interactive speed when searching and browsing the data.
As data volumes started to grow in the industries I was working in, I found myself at the perfect intersection of large data sets and speed to business insight. The data was growing in volume and being generated at faster and faster speeds, and business wanted answers to questions in shorter and shorter timeframes from the time data was being generated. To me, it was the perfect marriage: large data and a need for speed. Around 2001 I began to work on marketing analytics and e-commerce applications, where data was continuously being updated and we needed to provide insight into it in near real time. In 2009 I started working at Webtrends, where my love for speed and delivering insight at speed really matured. At Webtrends, analytics was our core business, and the idea of real-time analytics was just starting to catch the interest of our customers. The first project I worked on aimed to deliver key metrics in a dashboard within five minutes of a clickstream event happening anywhere in the world. At the time, that was pushing the envelope.
In 2011 I was part of an emerging products team. Our mission was to continue to push the idea of real-time analytics and try to disrupt our industry. After spending time researching, prototyping, and thinking through our next step, a perfect storm occurred. We had been looking at Apache Kafka, and then in September 2011 Apache Storm was open sourced. We immediately started to run like crazy with it. By winter we had early-adopter customers looking at what we were building. At that point we never looked back and set our sights on delivering on a Service Level Agreement (SLA) that was, in essence: From click to dashboard in three seconds or less, globally!
After many months and a lot of work by what became a much larger team, we delivered on our promise and won the Digital Analytics New Technology of the Year award (www.digitalanalyticsassociation.org/awards2013). I was deeply involved in building and architecting all aspects of this solution, from the data collection to the initial UI (which was affectionately called Bare Bones,
due to my lack of UI skills).
We continued our pursuit and began looking at Spark Streaming when it was still part of the Berkley AMPLab. Since those days I have continued to pursue building more and more streaming systems that deliver on the ultimate goal of delivering insights at the speed of thought. Today I continue to speak internationally on the topic and work with companies across the globe in designing, building, and solving streaming problems.
Even today I still see a widespread lack of understanding of all the pieces that go into building and delivering a streaming system. You can usually find references to pieces of the stack, but rarely do you find out how to think through the entire stack and understand each of the tiers.
It is therefore with great pleasure that I have tried in this book to share and distill this real-world experience and knowledge. My goal has been to provide a solid foundation from which you can build and explore a complete streaming system.
Acknowledgments
First, I want to thank my family for their support during the writing of this book. There were many weekends and nights of Sorry, I can’t help with the garden (or play lacrosse or go to the get-together)—I need to write.
I’m sure that wasn’t easy for my children to hear; nor was it always easy for my wife to buffer and pick up my slack. Through all the highs and lows that go into this process their support never wavered and they remained a constant source of encouragement and inspiration. For this I owe a tremendous debt of gratitude to my wife and children; a simple thank you cannot express it enough.
Thanks to Karen, my development editor, for her endless patience, understanding, and willingness to always talk things through with me throughout this entire journey. To Robin, my acquisition editor, for believing in me, nurturing the idea of this book, and being a sounding board to make sure the train was staying on the tracks during some rough patches in the early days. To Bert, for his teachings on how to tell a story, how to find the right level of depth with a narrative, and pedagogical insight into the construction of a technical book. To my technical development editor Gregor, whose very thoughtful and insightful feedback helped craft this book into what it is today. Lastly, but certainly not least, thanks to the entire Manning team for the fantastic effort to finally get us to this point.
Thanks also to all the people who bought and read early versions of the manuscript through the MEAP early access program, to those who contributed to the Author Online forum, and to the countless reviewers for their invaluable feedback, including Andrew Gibson, Dr. Tobias Bürger, Jake McCrary, Rodrigo Abreu, Andy Keffalas, John Guthrie, Kosmas Chatzimichalis, Giuliano Bertoti, Carlos Curotto, Andy Kirsch, Douglas Duncan, Jeff Smith, and Sergio Fernández González, Jaromir D.B. Nemec, Jose Samonte, Jan Nonnen, Romit Singhai, Chris Allan, Jonathan Thoms, Steven Jenkins, Lee Gilbert, Amandeep Khurana, Charlie Gaines. Without all of you, this book wouldn’t be what it is today.
Many others contributed in various different ways. I can’t mention everyone by name because the acknowledgments would just roll on and on, but a big thank you goes out to everyone else who had a hand in helping make this possible!
About this Book
The world of real-time systems has been around for a long time; for many years real-time and/or streaming was solely the domain of hardware real-time systems. Those are systems where if an SLA isn’t met, there is potential loss of life. Over the last decade near-real-time systems have emerged and grown at an amazing rate. Everywhere you look you can find examples of data streaming: social media, games, smart cities, smart meters, your new washing machine, and the list goes on. Consider the following: Today if a byte of data were a gallon of water, an average home would be filled within 10 seconds; by the year 2020, it will only take 2 seconds. Making sense of and using such a deluge of data means building streaming systems.
Focusing on the big ideas of streaming and real-time data, the goals of this book are two-fold: The first objective is to teach you how to think about the entire pipeline so you’re equipped with the skills to not only build a streaming system but also understand the tradeoffs at every tier. Secondly, this book is meant to provide a solid launching point for you to delve deeper into each tier, as your business needs require or as your interest pulls you.
How to use this book
Although this book was designed to read from start to finish, each chapter provides enough information so that you can read and understood it on its own. Therefore if want to understand a particular tier, you should feel comfortable jumping straight to that chapter and then using what you learned there as your base for deeper exploration of the other chapters.
Who should read this book
This book is perfect for developers or architects and has been written to be easily accessible to technical managers and business decision makers—no prior experience with streaming or real-time data systems required. The only technical requirement this book makes is that you should feel comfortable reading Java. The source code is written in Java, as is the example code that accompanies chapter 9
Roadmap
The roadmap of this book is represented in figure 1. A synopsis of each chapter follows.
Figure 1. Architectural blueprint with chapter mappings
Chapter 1 introduces the architectural blueprint of the book, which tells you where we are in the pipeline and serves as a great map if you need to jump from tier to tier. After laying out this blueprint, chapter 1 defines a real-time system, explores the differences between real-time and in-the-moment systems, and briefly touches on the importance of security (which could be its own book).
Chapter 2 explores all aspects of collecting data for a streaming system, from the interaction patterns through scaling and fault-tolerance techniques. This chapter covers all the relevant aspects of the collection tier and prepares you to build a scalable and reliable tier.
Chapter 3 is all about how to decouple the data being collected from the data being analyzed by using a message queuing tier in the middle. You will learn why you need a message queuing tier, how to understand message durability and different message delivery semantics, and how to choose the right technology for your business problem.
Chapter 4 dives into the common architectural patterns of distributed stream-processing frameworks, covering topics such as what message delivery semantics mean for this tier, how state is commonly handled, and what fault tolerance is and why we need it.
Chapter 5 jumps from discussing architecture to querying a stream, the problems with time, and the four popular summarization techniques. If chapter 4 is the what for distributed stream-processing engines, chapter 5 is the how.
Chapter 6 discusses options for storing data in-memory during and post analysis. It doesn’t spend much time discussing disk-based long-term storage solutions because they’re often used out of band of a streaming analysis and don’t offer the performance of the in-memory stores.
Chapter 7 is where we start to discuss what to do with the data we have collected and analyzed. It talks about communications patterns and protocols used for sending data to a streaming client. Along the way we’ll find out how to match up our business requirements to the various protocols and how to choose the right one.
Chapter 8 explores concepts to keep in mind when building a streaming client. This is not a chapter on just building an HTML web app; it goes much deeper into lower-level things to consider when designing the client side of a streaming system.
Chapter 9 . . . at this point, if you have read all the way through, congrats! A lot of material is covered in the first eight chapters. Chapter 9 is where we make it all come to life. Here we build a complete streaming data pipeline and discuss taking our sample to production.
About the code
All the code shown in the final chapter of this book can be found in the sample source code that accompanies this book. You can download the sample code free of charge from the Manning website at www.manning.com/books/streaming-data. You may also find the code on GitHub at https://github.com/apsaltis/StreamingData-Book-Examples.
The sample code is structured as separate Maven projects, one for each of the tiers we walk through in chapter 9. Instructions for building and running the software are provided during the walkthrough in chapter 9.
All source code in listings or in the text is in a fixed-width font like this to separate it from ordinary text. In some listings, the code is annotated to point out the key concepts.
About the author
Andrew Psaltis is deeply entrenched in streaming systems and obsessed with delivering insight at the speed of thought. He spends most of his waking hours thinking about, writing about, and building streaming systems. He helps customers of all sizes build and/or fix complex streaming systems, speaks around the globe about streaming, and teaches others how to build streaming systems. When he’s not busy being busy, he’s spending time with his lovely wife, two kids, and watching as much lacrosse as possible.
Author Online
The purchase of Streaming Data includes free access to a private forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and other users. To access and subscribe to the forum, point your browser to www.manning.com/books/streaming-data. This page provides information on how to get on the forum once you’re registered, what kind of help is available, and the rules of conduct in the forum.
Manning’s commitment to our readers is to provide a venue where meaningful dialogue between individual readers and between readers and the author can take place. It’s not a commitment to any specific amount of participation on the part of the author, whose contribution to the book’s forum remains voluntary (and unpaid). We suggest you try asking him challenging questions, lest his interest stray!
The Author Online forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.
About the cover illustration
The figure on the cover of Streaming Data is captioned Habit of a Moor of Morrocco in winter in 1695.
The illustration is taken from Thomas Jefferys’ A Collection of the Dresses of Different Nations, Ancient and Modern (four volumes), London, published between 1757 and 1772. The title page states that these are hand-colored copperplate engravings, heightened with gum arabic. Thomas Jefferys (1719–1771) was called Geographer to King George III.
He was an English cartographer who was the leading map supplier of his day. He engraved and printed maps for government and other official bodies and produced a wide range of commercial maps and atlases, especially of North America. His work as a mapmaker sparked an interest in local dress customs of the lands he surveyed and mapped, which are brilliantly displayed in this collection.
Fascination with faraway lands and travel for pleasure were relatively new phenomena in the late 18th century and collections such as this one were popular, introducing both the tourist as well as the armchair traveler to the inhabitants of other countries. The diversity of the drawings in Jefferys’ volumes speaks vividly of the uniqueness and individuality of the world’s nations some 200 years ago. Dress codes have changed since then and the diversity by region and country, so rich at the time, has faded away. It is now often hard to tell the inhabitant of one continent from another. Perhaps, trying to view it optimistically, we have traded a cultural and visual diversity for a more varied personal life. Or a more varied and interesting intellectual and technical life.
At a time when it is hard to tell one computer book from