M2 - Serverless Messaging With Cloud Pub - Sub
M2 - Serverless Messaging With Cloud Pub - Sub
M2 - Serverless Messaging With Cloud Pub - Sub
with Pub/Sub
Agenda
Processing Streaming Data
Cloud Pub/Sub
Advanced BigQuery
Functionality
Now, that we have a good understanding of the process of streaming data, let’s dive
into Cloud Pub/Sub to see how it works. As we begin this topic, I would ask you to
keep your mind open to new ways of doing things. Cloud Pub/Sub does streaming
differently than probably anything you have used in the past. It may be a different
model than what you have seen before.
Cloud Pub/Sub
Cloud Pub/Sub provides a fully managed data distribution and delivery system. It can
be used for many purposes. It is most popularly used to loosely-couple parts of a
system. You can use Cloud Pub/Sub to connect applications within Google Cloud,
and with applications on premise or in other clouds to create hybrid Data Engineering
solutions. With Cloud Pub/Sub the applications do not need to be online and available
all the time. And the parts do not need to know how to communicate to each other, but
only to Cloud Pub/Sub, which can simplify system design.
First, Pub/Sub is not software; it is a service. So, like all of the other serverless
services we have looked at, you don’t install anything to use Pub/Sub. It is totally a
service.
Cloud Pub/Sub client libraries are available in C#, GO, Java, Node.js, Python, and
Ruby. These wrap REST API calls which can be made in any language.
It is highly available.
Cloud Pub/Sub
It offers durability of messages. By default it will save your messages for seven days.
In the event your systems are down and not able to process them.
https://pixabay.com/illustrations/chess-black-and-white-pieces-3413429/
Cloud Pub/Sub
Finally, it Cloud Pub/Sub is highly scalable as well. We, Google, actually process
internally, about 100 million messages per second across the entire infrastructure.
This was actually one of the use cases for Pub/Sub early on at Google, to be able to
distribute the search engine and index around the world because we keep local
copies of search around the world, as you might imagine, in order to be able to serve
up results with minimal latency.
If you think about how you would architect that, we are crawling the entire world wide
web, so we need to send the entire world wide web around the world. Either that or
we would need to have multiple crawlers all over the world, but then you have data
consistency problems; they would all be getting different indexes.
Therefore, what we do is use Pub/Sub to distribute. As the crawler goes out, it grabs
every page from the world wide web, and we send every single page as a message
on Pub/Sub and it gets picked up by all local copies of the search index so it can be
indexed.
Currently, Google indexes the web anywhere from every two weeks, which is the
slowest, to more than once an hour, for example on really popular news sites. So, on
average, Google is indexing the web, on average, three times a day. Thus, what we
are doing is sending the entire world wide web over Pub/Sub three times a day.
You control the qualities of your Cloud Pub/Sub solution by the number of publishers,
number of subscribers, and the size and number of messages. These factors provide
tradeoffs between scale, low latency, and high throughput.
https://pixabay.com/illustrations/chess-black-and-white-pieces-3413429/
Example of a Cloud Pub/Sub application
Cloud Pub/Sub
Topic
Subscription
How does Pub/Sub work? The model is very simple. The story of Cloud Pub/Sub is
the story of two data structures, the Topic and the Subscription. Both the Topic and
the Subscription areabstractopms wjocj exist in the Pub/sub framework independently
of any workers, subscribers, ect. The Cloud Pub/Sub client that creates the Topic is
called the Publisher. And the Cloud Pub/Sub client that creates the Subscription is
called the Subscriber.
<INSTRUCTOR>
To receive messages published to a topic, you must create a subscription to that
topic. Only messages published to the topic after the subscription is created are
available to subscriber applications. The subscription connects the topic to a
subscriber application that receives and processes messages published to the topic.
A topic can have multiple subscriptions, but a given subscription belongs to a single
topic.
</INSTRUCTOR>
A new employees arrives causing a new hire event.
HR System
New Hire Event
Cloud Pub/Sub
Subscription
Here you see, there is an HR Topic that relates to New Hire Events. For example, a
new person joins your company and this notification should allow other applications
that need to be notified about a new user joining to subscribe and get that message.
What applications could tell you that a new person joined?
https://pixabay.com/illustrations/avatar-clients-customers-icons-2155431/
The message is sent from Topic to Subscription
HR System
New Hire Event
Cloud Pub/Sub
HR Topic
Subscription
Directory
Service
One example is the Company Directory. This is a client of the Subscription also called
a Subscriber. However, Cloud Pub/Sub is not limited to one Subscriber or one
Subscription.
There can be multiple Subscriptions for each Topic
HR System
New Hire Event
Cloud Pub/Sub
HR Topic
Decoupled
subscriber Directory Facilities Account
clients Service System Provisioning
System
Here there are multiple Subscriptions and multiple Subscribers. Maybe the Facilities
System needs to know about the new employee for badging, and the accounting
provisioning system needs to know for payroll.
These subscriber clients are decoupled from one another and isolated from the
publisher. In fact, we will see later that the HR System could go offline after it has sent
its message to the HR Topic, and the message will still be delivered to the
subscribers.
These examples show one Subscription and one Subscriber. But you can actually
have more Subscribers for a single Subscription.
And there can be multiple subscribers per Subscription
HR System
New Hire Event
Cloud Pub/Sub
HR Topic
Pull
Badge
Activation
Push
System
In this example, the Badge Activation System requires a human being to activate the
badge. There are multiple workers, but not all of them are available all the time.
Cloud Pub/Sub makes the message available to all of them. But only one person
needs to fetch the message and handle it. This is called a Pull Subscription. The other
examples are Push Subscriptions.
https://pixabay.com/illustrations/avatar-clients-customers-icons-2191918/
And there can be multiple publishers to the Topic
Decoupled Vendor
publisher New Contractor Event Office
clients
Cloud Pub/Sub
HR Topic new contractor
Pull
Badge
Activation
Push
System
Now, a new contractor arrives. Instead of entering through the HR System, they go
through the Vendor Office. The same kinds of actions need to occur for this worker.
They need to be listed in the company directory. The facilities needs to assign them a
desk. Account provisioning needs to set up their corporate identity and accounts. And
the badge activation system needs to print and activate their contractor badge. A
message can be published by the Vendor Office to the HR Topic. The Vendor Office
and the HR System are entirely decoupled from one another but can make use of the
same company services.
You can see from this illustration how important the Pub/Sub is. Therefore, it gets the
highest priority.
You have learned generally what Cloud Pub/Sub does. Next, you will learn how it
works and many of the advanced features it provides.
https://pixabay.com/illustrations/avatar-clients-customers-icons-2191918/
Publish/Subscribe patterns
Publisher Publisher Publisher Publisher
Message Message
Message Message
Message Message
Message Message
You can use this distribution of messages, the publish and subscribe patterns, to do
fan in or fan out. What you see here, the different colors represent different
messages.
Publish/Subscribe patterns
Publisher Publisher Publisher Publisher
Message Message
Message Message
Message Message
Message Message
(Fan out) Multiple subscribers, where you have multiple use case for same piece of
data, and all data is sent to multiple different subscribers.
Publish/Subscribe patterns
Publisher Publisher Publisher Publisher
Message Message
Message Message
Message Message
Message Message
What you see in the second example is three different messages from two different
publishers being sent, but all on the same topic, and that means the subscription will
get all three messages.
Publish/Subscribe patterns
Publisher Publisher Publisher Publisher
Message Message
Message Message
Message Message
Message Message
Over on the right, we have two subscriptions, so both are going to get the messages,
both the red message and the blue message.
Cloud Pub/Sub provides both Push and Pull delivery
Pull SUB C
Subscription SUB D
Publish Pull
P m1 Msg
m1 S
ACK
Topic Subscription
Cloud Pub/Sub allos for both Push and Pull delivery. In the pull model, your clients are
subscribers and will be periodically calling for messages and Pub/Sub will just be
delivering the messages since the last call.
In the Pull model, your going to have to acknowledge the message as a separate
step. So, what you see here is we initially make the call to the subscribers, it pulls the
messages, it gets a message back, and then, separately, it acknowledges that
message.
The reason for this is because the pull queues are often used to implement some kind
of queueing system for work to be done, so you don’t want to acknowledge the
message until you firmly have message and have done the processing on it,
otherwise you might lose the message if the system goes down.
Therefore, we generally recommend you wait to acknowledge until after you have
gotten it.
In the Pull model, the messages are stored for up to seven days.
Cloud Pub/Sub provides both Push and Pull delivery
Push SUB A Pull SUB C
In the Push model, it actually uses an HTTP endpoint. You register an webhook as
your subscription, and Pub/Sub infrastructure itself will call you with the latest
messages. In the case of Push, you just respond with ‘status 200 ok’ for the HTTP
call, and that tells Pub/Sub the message delivery was successful.
It will actually use the rate of your success responses to self limit so that it doesn’t
overload your worker.
At least once delivery guarantee
● A message is resent if
Cloud Pub/Sub Cloud Pub/Sub
subscriber takes more than
ackDeadline to respond
The way the acknowledgements work is to ensure every message gets delivered at
least once. What happens is when you acknowledge a message, you acknowledge
on a per subscription basis. So, if you have two subscriptions, you have one
acknowledge and the other one doesn’t, the one that acknowledged will continue to
get the messages. Pub/Sub will continue to try to deliver the message up to seven
days until it is acknowledged.
Now, there is a replay mechanism as well that you can rewind and go back in time
and have it replay messages, but in any case, you will always be able to go back
seven days.
You can also set the acknowledgement deadline and do that on a per subscription.
So if you know that on average it takes you 15 seconds to process a message in your
work queue, then you might set your ackDeadline to 20 seconds. This will ensure it
doesn’t try to redeliver the messages.
Subscribers can work as a team or separately
Only one client needs to
acknowledge receipt
Topic Subscriber
m1
m1
m2 m3
m2 Subscriber
m3 Subscription 1
Subscription 2 m3
m2
m1
Subscriber
Guaranteed delivery
In this case, it is going to distribute the message, so one and three go to Subscription
1, and two goes to Subscription 2. And it is just random based on when it pulls from
messages throughout the day
In the case of a Push subscription, you only have one web inpoint, so you will only
have one subscriber typically. But, that one subscriber could be a app engine
standard app, or cloud run container image, which autoscales. So, it is one web
endpoint, but it can have autoscale workers behind the scenes. And that is actually a
very good pattern.
Publishing with Cloud Pub/Sub
gcloud pubsub topics create sandiego Create topic
Let’s look at a little bit of code now. This example is using the Client Library for
Pub/Sub. If we want to publish the message, we first create the topic.
Publishing with Cloud Pub/Sub
gcloud pubsub topics create sandiego Create topic
import os Python
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
Create a client
topic_name = 'projects/{project_id}/topics/{topic}'.format(
project_id=os.getenv('GOOGLE_CLOUD_PROJECT'), Set topic name
topic='MY_TOPIC_NAME',
)
publisher.create_topic(topic_name)
publisher.publish(topic_name, b'My first message!', author='dylan')
More commonly, it will be done in code. Here we get a PublisherClient, create topic,
and publish the message.
Publishing with Cloud Pub/Sub
gcloud pubsub topics create sandiego Create topic
import os Python
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
Create a client
topic_name = 'projects/{project_id}/topics/{topic}'.format(
project_id=os.getenv('GOOGLE_CLOUD_PROJECT'), Set topic name
topic='MY_TOPIC_NAME',
)
publisher.create_topic(topic_name)
publisher.publish(topic_name, b'My first message!', author='dylan')
Notice the b, right here in front of ‘My first message!’ This is because Pub/Sub just
sends raw bytes. This means that you aren’t restricted to just text. YOu can send
other data, like images if you wanted to. The limit is 10 MB.
Publishing with Cloud Pub/Sub
gcloud pubsub topics create sandiego Create topic
import os Python
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
Create a client
topic_name = 'projects/{project_id}/topics/{topic}'.format(
project_id=os.getenv('GOOGLE_CLOUD_PROJECT'), Set topic name
topic='MY_TOPIC_NAME',
)
publisher.create_topic(topic_name)
publisher.publish(topic_name, b'My first message!', author='dylan')
There are also extra attributes that you can include in messages. In this example, you
see author equals dylan. Pub/Sub will keep track of those attributes to allow your
downstream systems to get metadata about your messages without having to put it all
in the message and parse it out. So instead of serializing and de-serializing, it will just
keep track of those key value pairs.
Some of them have special meaning. We will see some of those shortly.
Subscribing with Cloud Pub/Sub using async pull
import os Python
from google.cloud import pubsub_v1
def callback(message):
callback when
print(message.data)
message received Push method
message.ack() Callback function
future = subscriber.subscribe(subscription_name, callback)
To subscribe with Pub/Sub using the Push method, the code is similar. Select the
topic.
Subscribing with Cloud Pub/Sub using async pull
import os Python
from google.cloud import pubsub_v1
def callback(message):
callback when
print(message.data)
message received Push method
message.ack() Callback function
future = subscriber.subscribe(subscription_name, callback)
def callback(message):
callback when
print(message.data)
message received Push method
message.ack() Callback function
future = subscriber.subscribe(subscription_name, callback)
import time
# The subscriber pulls a specific number of messages. Keep the main thread from
exiting to allow it to process
response = subscriber.pull(subscription_path, max_messages=NUM_MESSAGES) messages synchronously
When you are doing a pull subscription, it looks like this. You can pull messages from
the command line. You will see this in the lab. By default it will just churn one
message, the latest message, but there is a dash-dash limit you can set. Maybe you
want 10 messages at a time, you can try that in the lab.
By default, the Publisher batches messages; turn
this off if you desire lower latency
m1 Topic m1 m1
SUB
m2
m2 m2 m3
PUB m3 m3
Subscription 1
You can also batch publish messages. This is just prevents the overhead of the call
for individual messages on the publisher side. This allows the publisher to wait and
send 10, or 50, at a time. This increases efficiency. However, if you are waiting for 50
messages, this means the first one now has latency associated with it. So, it is a trade
off in your system. What do you want to optimize? But, in any case, even if you batch
publish, they still get delivered one at a time to your subscribers.
Python
from google.cloud import pubsub
from google.cloud.pubsub import types
client = pubsub.PublisherClient(
batch_settings=BatchSettings(max_messages=500),
)
- Latency -- no guarantees
Just because data can be delivered out of order and in multiple instances, does not
mean you have to process the data that way.
You could write an application that handles out-of-order and replicated messages.
This is different than a true queueing system. In general, they will be delivered in
order, but you can’t rely on that with Pub/Sub. This is because, again, one of the
compromises made for scalability especially since it is a global service. We have a
mesh network, so a message might take another route and if it happens to be a
slower route, you could have an earlier message arriving later.
So, for example, you wouldn’t use this to implement a chat app because it would be
awkward when messages arrive out of order. Therefore, we will handle ordering using
other techniques.
We are going to use Dataflow in conjunction with Pub/Sub to solve these problems.
Dataflow will deduplicate messages based on the message ID because in Pub/Sub, if
a message is delivered twice, it will have the same ID in both cases. BigQuery also
has limited capabilities to deduplicate.
Dataflow will not be able to order in the sense of providing exact sequential order of
when messages were published. However, it will help us deal with late data.
Using Pub/Sub and Dataflow together, allows us to get a scale that wouldn’t be
possible otherwise.
Use Cloud Pub/Sub for streaming resilience
Application
Application
receives
more data
than it can
process
Application Application
Application
receives Data is held until
more data application is ready
than it can
process
Let’s recap this then. When you look at the example on this slide, in the example on
the left, an overload of arriving data causes a traffic spike. This overdrives the
resources of the application as illustrated by the smoke. One solution to this problem
is to size the application to handle the highest traffic spike plus some additional
capacity as a safety buffer. This is not only wasteful of resources which must be
retained at top capacity even when not being used, but it provides a recipe for a
distributed denial of service attack by creating an upper limit at which the application
will cease to behave normally and will exhibit non-deterministic behavior.
The solution on the right uses Cloud Pub/Sub as an intermediary, receiving and
holding data until the application has resources to handle it, either through processing
the backlog of work, or by autoscaling to meet the demand.
https://pixabay.com/photos/smoke-background-artwork-swirl-69124/
Security, monitoring, and logging for Pub/Sub
Cloud Audit Logs maintains three audit logs for each Google Cloud project, folder, and
organization: Admin Activity, Data Access, and System Event.
Admin Activity audit logs contain log entries for API calls or other administrative
actions that modify the configuration or metadata of resources.
Data Access audit logs contain API calls that read the configuration or metadata of
resources, as well as user-driven API calls that create, modify, or read user-provided
resource data.
Admin Activity audit logs are always written; you can't configure or disable them.
Data Access audit logs are disabled by default because they can be quite large; they
must be explicitly enabled to be written.
What we are trying to do in this lab is to create a Pub/Sub topic and subscription and
simulate San Francisco traffic data into Pub/Sub. This is the first of four parts we will
do.