M2 - Serverless Messaging With Cloud Pub - Sub

Serverless Messaging
with Pub/Sub
Agenda
Processing Streaming Data
Cloud Pub/Sub
Cloud Dataflow Streaming

Features
BigQuery and Bigtable Streaming

Features
Advanced BigQuery
Functionality
Now, that we have a good understanding of the process of streaming data, let’s dive
into Cloud Pub/Sub to see how it works. As we begin this topic, I would ask you to
keep your mind open to new ways of doing things. Cloud Pub/Sub does streaming
differently than probably anything you have used in the past. It may be a different
model than what you have seen before.
Cloud Pub/Sub
Qualities that Cloud Pub/Sub Availability

contribute to Data Engineering Durability
Cloud
solutions: Scalability
100s of
Pub/Sub
milliseconds
Cloud Pub/Sub provides a fully managed data distribution and delivery system. It can
be used for many purposes. It is most popularly used to loosely-couple parts of a
system. You can use Cloud Pub/Sub to connect applications within Google Cloud,
and with applications on premise or in other clouds to create hybrid Data Engineering
solutions. With Cloud Pub/Sub the applications do not need to be online and available
all the time. And the parts do not need to know how to communicate to each other, but
only to Cloud Pub/Sub, which can simplify system design.
First, Pub/Sub is not software; it is a service. So, like all of the other serverless
services we have looked at, you don’t install anything to use Pub/Sub. It is totally a
service.
Cloud Pub/Sub client libraries are available in C#, GO, Java, Node.js, Python, and
Ruby. These wrap REST API calls which can be made in any language.
It is highly available.
Cloud Pub/Sub

Cloud
100s of
Pub/Sub
milliseconds
It offers durability of messages. By default it will save your messages for seven days.
In the event your systems are down and not able to process them.
https://pixabay.com/illustrations/chess-black-and-white-pieces-3413429/
Cloud Pub/Sub

Cloud
100s of
Pub/Sub
milliseconds
Finally, it Cloud Pub/Sub is highly scalable as well. We, Google, actually process
internally, about 100 million messages per second across the entire infrastructure.
This was actually one of the use cases for Pub/Sub early on at Google, to be able to
distribute the search engine and index around the world because we keep local
copies of search around the world, as you might imagine, in order to be able to serve
up results with minimal latency.
If you think about how you would architect that, we are crawling the entire world wide
web, so we need to send the entire world wide web around the world. Either that or
we would need to have multiple crawlers all over the world, but then you have data
consistency problems; they would all be getting different indexes.
Therefore, what we do is use Pub/Sub to distribute. As the crawler goes out, it grabs
every page from the world wide web, and we send every single page as a message
on Pub/Sub and it gets picked up by all local copies of the search index so it can be
indexed.
Currently, Google indexes the web anywhere from every two weeks, which is the
slowest, to more than once an hour, for example on really popular news sites. So, on
average, Google is indexing the web, on average, three times a day. Thus, what we
are doing is sending the entire world wide web over Pub/Sub three times a day.
This demonstrates how Pub/Sub scales.

Cloud Pub/Sub

Cloud
100s of
Pub/Sub
milliseconds
Cloud Pub/Sub is a HIPAA-compliant service, offering fine-grained access controls

and end-to-end encryption. Messages are encrypted in transit and at rest. Messages
are stored in multiple locations for durability and availability.
You control the qualities of your Cloud Pub/Sub solution by the number of publishers,
number of subscribers, and the size and number of messages. These factors provide
tradeoffs between scale, low latency, and high throughput.
https://pixabay.com/illustrations/chess-black-and-white-pieces-3413429/
Example of a Cloud Pub/Sub application
Cloud Pub/Sub
Topic
Subscription
How does Pub/Sub work? The model is very simple. The story of Cloud Pub/Sub is
the story of two data structures, the Topic and the Subscription. Both the Topic and
the Subscription areabstractopms wjocj exist in the Pub/sub framework independently
of any workers, subscribers, ect. The Cloud Pub/Sub client that creates the Topic is
called the Publisher. And the Cloud Pub/Sub client that creates the Subscription is
called the Subscriber.
In this example, the Subscription is subscribed to the Topic.
<INSTRUCTOR>
To receive messages published to a topic, you must create a subscription to that
topic. Only messages published to the topic after the subscription is created are
available to subscriber applications. The subscription connects the topic to a
subscriber application that receives and processes messages published to the topic.
A topic can have multiple subscriptions, but a given subscription belongs to a single
topic.
</INSTRUCTOR>
A new employees arrives causing a new hire event.
HR System
New Hire Event
Cloud Pub/Sub
new employee HR Topic
Subscription
It is an enterprise message bus. So, how does it work?
Here you see, there is an HR Topic that relates to New Hire Events. For example, a
new person joins your company and this notification should allow other applications
that need to be notified about a new user joining to subscribe and get that message.
What applications could tell you that a new person joined?
https://pixabay.com/illustrations/avatar-clients-customers-icons-2155431/
The message is sent from Topic to Subscription
HR System
New Hire Event
Cloud Pub/Sub
HR Topic
Subscription
Directory
Service
One example is the Company Directory. This is a client of the Subscription also called
a Subscriber. However, Cloud Pub/Sub is not limited to one Subscriber or one
Subscription.
There can be multiple Subscriptions for each Topic
HR System
New Hire Event
Cloud Pub/Sub
HR Topic
Subscription Subscription Subscription
Decoupled
subscriber Directory Facilities Account
clients Service System Provisioning
System
Here there are multiple Subscriptions and multiple Subscribers. Maybe the Facilities
System needs to know about the new employee for badging, and the accounting
provisioning system needs to know for payroll.
Each Subscriptions guarantees delivery of the message to the service.
These subscriber clients are decoupled from one another and isolated from the
publisher. In fact, we will see later that the HR System could go offline after it has sent
its message to the HR Topic, and the message will still be delivered to the
subscribers.
These examples show one Subscription and one Subscriber. But you can actually
have more Subscribers for a single Subscription.
And there can be multiple subscribers per Subscription
HR System
New Hire Event
Cloud Pub/Sub
HR Topic
Subscription Subscription Subscription Subscription
Pull
Badge
Activation
Push
System
Directory Facilities Account

Service System Provisioning
System
In this example, the Badge Activation System requires a human being to activate the
badge. There are multiple workers, but not all of them are available all the time.
Cloud Pub/Sub makes the message available to all of them. But only one person
needs to fetch the message and handle it. This is called a Pull Subscription. The other
examples are Push Subscriptions.
And there can be multiple publishers to the Topic
Decoupled Vendor
publisher New Contractor Event Office
clients
Cloud Pub/Sub
HR Topic new contractor
Pull
Badge
Activation
Push
System
Directory Facilities Account

Service System Provisioning
System
Now, a new contractor arrives. Instead of entering through the HR System, they go
through the Vendor Office. The same kinds of actions need to occur for this worker.
They need to be listed in the company directory. The facilities needs to assign them a
desk. Account provisioning needs to set up their corporate identity and accounts. And
the badge activation system needs to print and activate their contractor badge. A
message can be published by the Vendor Office to the HR Topic. The Vendor Office
and the HR System are entirely decoupled from one another but can make use of the
same company services.
You can see from this illustration how important the Pub/Sub is. Therefore, it gets the
highest priority.
You have learned generally what Cloud Pub/Sub does. Next, you will learn how it
works and many of the advanced features it provides.
Publish/Subscribe patterns
Publisher Publisher Publisher Publisher
Message Message
Message Message
Message Message
Topic Topic Topic
Message Message
Message Message Message Message Message Message
< Action Safe
Subscriber Subscriber Subscriber Subscriber Subscriber Subscriber

Title Safe >
You can use this distribution of messages, the publish and subscribe patterns, to do
fan in or fan out. What you see here, the different colors represent different
messages.
Message Message
Message Message
Message Message
Topic Topic Topic
Message Message
< Action Safe

Title Safe >
Basic pattern just a straight through, which is a queue.
(Fan in / Load Balancing) Multiple Publishers publishing to the same topic

Also have Multiple subscribers pulling from same subscriptions, basic parallelization
of processing. dataflow,
one subscription multiple consumers and each Subscriber receives a subset of
messages from the subscription.
(Fan out) Multiple subscribers, where you have multiple use case for same piece of
data, and all data is sent to multiple different subscribers.
Message Message
Message Message
Message Message
Topic Topic Topic
Message Message
< Action Safe

Title Safe >
What you see in the second example is three different messages from two different
publishers being sent, but all on the same topic, and that means the subscription will
get all three messages.
Message Message
Message Message
Message Message
Topic Topic Topic
Message Message
< Action Safe

Title Safe >
Over on the right, we have two subscriptions, so both are going to get the messages,
both the red message and the blue message.
Cloud Pub/Sub provides both Push and Pull delivery
Pull SUB C
Subscription SUB D
Publish Pull
P m1 Msg
m1 S
ACK
Topic Subscription
Messages are stored up

to 7 days
Cloud Pub/Sub allos for both Push and Pull delivery. In the pull model, your clients are
subscribers and will be periodically calling for messages and Pub/Sub will just be
delivering the messages since the last call.
In the Pull model, your going to have to acknowledge the message as a separate
step. So, what you see here is we initially make the call to the subscribers, it pulls the
messages, it gets a message back, and then, separately, it acknowledges that
message.
The reason for this is because the pull queues are often used to implement some kind
of queueing system for work to be done, so you don’t want to acknowledge the
message until you firmly have message and have done the processing on it,
otherwise you might lose the message if the system goes down.
Therefore, we generally recommend you wait to acknowledge until after you have
gotten it.
In the Pull model, the messages are stored for up to seven days.
Cloud Pub/Sub provides both Push and Pull delivery
Push SUB A Pull SUB C
Subscription SUB B Subscription SUB D
Publish Publish Pull

Push
P m1 m1 P m1 Msg
ACK S m1 S
ACK
Topic Subscription Topic Subscription
Acknowledgement used Messages are stored up

for dynamic rate control to 7 days
In the Push model, it actually uses an HTTP endpoint. You register an webhook as
your subscription, and Pub/Sub infrastructure itself will call you with the latest
messages. In the case of Push, you just respond with ‘status 200 ok’ for the HTTP
call, and that tells Pub/Sub the message delivery was successful.
It will actually use the rate of your success responses to self limit so that it doesn’t
overload your worker.
At least once delivery guarantee
● A subscriber ACKs each Pull subscription Push subscription

message for every subscription
Pub/Sub system Pub/Sub system
● A message is resent if
Cloud Pub/Sub Cloud Pub/Sub
subscriber takes more than
ackDeadline to respond
● Messages are stored for up to 7

Message Ack Message Ack
days
● A subscriber can extend the

deadline per message Subscriber Subscriber
< Action Safe
Title Safe >
The way the acknowledgements work is to ensure every message gets delivered at
least once. What happens is when you acknowledge a message, you acknowledge
on a per subscription basis. So, if you have two subscriptions, you have one
acknowledge and the other one doesn’t, the one that acknowledged will continue to
get the messages. Pub/Sub will continue to try to deliver the message up to seven
days until it is acknowledged.
Now, there is a replay mechanism as well that you can rewind and go back in time
and have it replay messages, but in any case, you will always be able to go back
seven days.
You can also set the acknowledgement deadline and do that on a per subscription.
So if you know that on average it takes you 15 seconds to process a message in your
work queue, then you might set your ackDeadline to 20 seconds. This will ensure it
doesn’t try to redeliver the messages.
Subscribers can work as a team or separately
Only one client needs to
acknowledge receipt
Topic Subscriber
m1
m1
m2 m3
m2 Subscriber
m3 Subscription 1
Subscription 2 m3
m2
m1
Subscriber
Guaranteed delivery
Subscribers can work as a individually or as a group. If we have just one subscriber, it

is going to get every message delivered through that subscription. However, you can
set up worker pools by having multiple subscribers sharing the same subscription.
In this case, it is going to distribute the message, so one and three go to Subscription
1, and two goes to Subscription 2. And it is just random based on when it pulls from
messages throughout the day
In the case of a Push subscription, you only have one web inpoint, so you will only
have one subscriber typically. But, that one subscriber could be a app engine
standard app, or cloud run container image, which autoscales. So, it is one web
endpoint, but it can have autoscale workers behind the scenes. And that is actually a
very good pattern.
Publishing with Cloud Pub/Sub
gcloud pubsub topics create sandiego Create topic
Let’s look at a little bit of code now. This example is using the Client Library for
Pub/Sub. If we want to publish the message, we first create the topic.
gcloud pubsub topics publish sandiego --message Publish to topic

"hello"
Then, we can publish the topic on the command line.

gcloud pubsub topics publish sandiego --message

Publish to topic
"hello"
import os Python
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
Create a client
topic_name = 'projects/{project_id}/topics/{topic}'.format(
project_id=os.getenv('GOOGLE_CLOUD_PROJECT'), Set topic name
topic='MY_TOPIC_NAME',
)
publisher.create_topic(topic_name)
publisher.publish(topic_name, b'My first message!', author='dylan')
Message Send attribute
More commonly, it will be done in code. Here we get a PublisherClient, create topic,
and publish the message.

Publish to topic
"hello"
import os Python
Create a client
)
Notice the b, right here in front of ‘My first message!’ This is because Pub/Sub just
sends raw bytes. This means that you aren’t restricted to just text. YOu can send
other data, like images if you wanted to. The limit is 10 MB.

Publish to topic
"hello"
import os Python
Create a client
)
There are also extra attributes that you can include in messages. In this example, you
see author equals dylan. Pub/Sub will keep track of those attributes to allow your
downstream systems to get metadata about your messages without having to put it all
in the message and parse it out. So instead of serializing and de-serializing, it will just
keep track of those key value pairs.
Some of them have special meaning. We will see some of those shortly.
Subscribing with Cloud Pub/Sub using async pull
import os Python
subscriber = pubsub_v1.SubscriberClient() Create a client

project_id=os.getenv('GOOGLE_CLOUD_PROJECT'), Select
topic='MY_TOPIC_NAME', topic
) name
subscription_name = 'projects/{project_id}/subscriptions/{sub}'.format(
project_id=os.getenv('GOOGLE_CLOUD_PROJECT'),
sub='MY_SUBSCRIPTION_NAME',
)
subscriber.create_subscription( Set subscription
name=subscription_name, topic=topic_name) name
def callback(message):
callback when
print(message.data)
message received Push method
message.ack() Callback function
future = subscriber.subscribe(subscription_name, callback)
To subscribe with Pub/Sub using the Push method, the code is similar. Select the
topic.
import os Python

) name
)
callback when
print(message.data)
Name the subscription.

import os Python

) name
)
callback when
print(message.data)
This is a push subscription, so we will define a callback.

Subscribing with Cloud Pub/Sub using synchronous pull
gcloud pubsub subscriptions create --topic sandiego mySub1 Create subscription
gcloud pubsub subscriptions pull --auto-ack mySub1 Pull subscription
import time

Set subscription name
subscription_path = subscriber.subscription_path(project_id, subscription_name)
`projects/{project_id}/subscriptions/{subscription_name}` subscription_path format
NUM_MESSAGES = 2
ACK_DEADLINE = 30
SLEEP_TIME = 10 Subscriber is non-blocking
# The subscriber pulls a specific number of messages. Keep the main thread from
exiting to allow it to process
response = subscriber.pull(subscription_path, max_messages=NUM_MESSAGES) messages synchronously
When you are doing a pull subscription, it looks like this. You can pull messages from
the command line. You will see this in the lab. By default it will just churn one
message, the latest message, but there is a dash-dash limit you can set. Maybe you
want 10 messages at a time, you can try that in the lab.
By default, the Publisher batches messages; turn
this off if you desire lower latency
m1 Topic m1 m1
SUB
m2
m2 m2 m3
PUB m3 m3
Subscription 1
Batching messages: throughput versus latency
You can also batch publish messages. This is just prevents the overhead of the call
for individual messages on the publisher side. This allows the publisher to wait and
send 10, or 50, at a time. This increases efficiency. However, if you are waiting for 50
messages, this means the first one now has latency associated with it. So, it is a trade
off in your system. What do you want to optimize? But, in any case, even if you batch
publish, they still get delivered one at a time to your subscribers.
We will practice this technique in the lab.

Changing the batch settings in Cloud Pub/Sub
Python
from google.cloud import pubsub
from google.cloud.pubsub import types
client = pubsub.PublisherClient(
batch_settings=BatchSettings(max_messages=500),
)
Change batch setting
Here is how you would set the batch in python code.

Pub/Sub: latency, out-of-order, duplication will happen
- Latency -- no guarantees
- Messages can be delivered in any order,

especially with large backlog
- Duplication may happen
< Action Safe
Title Safe >
Just because data can be delivered out of order and in multiple instances, does not
mean you have to process the data that way.
You could write an application that handles out-of-order and replicated messages.
This is different than a true queueing system. In general, they will be delivered in
order, but you can’t rely on that with Pub/Sub. This is because, again, one of the
compromises made for scalability especially since it is a global service. We have a
mesh network, so a message might take another route and if it happens to be a
slower route, you could have an earlier message arriving later.
So, for example, you wouldn’t use this to implement a chat app because it would be
awkward when messages arrive out of order. Therefore, we will handle ordering using
other techniques.
Finally, we need to be ready for duplication.

Cloud Pub/sub with Dataflow:
Exactly once, ordered processing
Cloud Pub/Sub delivers at least once
Cloud Dataflow: Deduplicate, order, and window
Separation of concerns → scale
< Action Safe
Title Safe >
We are going to use Dataflow in conjunction with Pub/Sub to solve these problems.
Dataflow will deduplicate messages based on the message ID because in Pub/Sub, if
a message is delivered twice, it will have the same ID in both cases. BigQuery also
has limited capabilities to deduplicate.
Dataflow will not be able to order in the sense of providing exact sequential order of
when messages were published. However, it will help us deal with late data.
Using Pub/Sub and Dataflow together, allows us to get a scale that wouldn’t be
possible otherwise.
Use Cloud Pub/Sub for streaming resilience
Application
Application
receives
more data
than it can
process
Pub/Sub is also going to help us with streaming resilience, or buffering. What

happens if your systems get overloaded with large volumes of transactions, like Black
Friday? What you really need is some sort of buffer or backlog so that you can feed
messages only as fast as the systems are able to process them. Pub/Sub has this as
a built in capability.
Use Cloud Pub/Sub for streaming resilience
Application Application
Application
receives Data is held until
more data application is ready
than it can
process
Let’s recap this then. When you look at the example on this slide, in the example on
the left, an overload of arriving data causes a traffic spike. This overdrives the
resources of the application as illustrated by the smoke. One solution to this problem
is to size the application to handle the highest traffic spike plus some additional
capacity as a safety buffer. This is not only wasteful of resources which must be
retained at top capacity even when not being used, but it provides a recipe for a
distributed denial of service attack by creating an upper limit at which the application
will cease to behave normally and will exhibit non-deterministic behavior.
The solution on the right uses Cloud Pub/Sub as an intermediary, receiving and
holding data until the application has resources to handle it, either through processing
the backlog of work, or by autoscaling to meet the demand.
https://pixabay.com/photos/smoke-background-artwork-swirl-69124/
Security, monitoring, and logging for Pub/Sub
Cloud Logging Cloud Audit Logs

metrics Admin = on
pubsub_topic Data Access = off
pubsub_subscription
Authentication Access control

Service accounts Pub/Sub roles per topic
User accounts
Cloud Audit Logs maintains three audit logs for each Google Cloud project, folder, and
organization: Admin Activity, Data Access, and System Event.
Admin Activity audit logs contain log entries for API calls or other administrative
actions that modify the configuration or metadata of resources.
Data Access audit logs contain API calls that read the configuration or metadata of
resources, as well as user-driven API calls that create, modify, or read user-provided
resource data.
Admin Activity audit logs are always written; you can't configure or disable them.
Data Access audit logs are disabled by default because they can be quite large; they
must be explicitly enabled to be written.
Pub/Sub reports metrics to Cloud Logging, including pubsub_topic and

pubsub_subscription which you can monitor against service quota utilization in a
dashboard, and for setting notifications and alerts.
Authentication is provided by service accounts. You can also directly authenticate

users by their user accounts and their identity is reported in audit logs. But user
account authentication is not recommended. Access control is provided by Cloud IAM.
Lab
Publish Streaming Data
into Pub/Sub
Now, let’s practice publishing streaming data into Pub/Sub.

Lab Objectives
Create a Pub/Sub topic and
subscription
Simulate your traﬃc sensor data

into Pub/Sub
What we are trying to do in this lab is to create a Pub/Sub topic and subscription and
simulate San Francisco traffic data into Pub/Sub. This is the first of four parts we will
do.

M2 - Serverless Messaging With Cloud Pub - Sub

Uploaded by

Copyright:

Available Formats

M2 - Serverless Messaging With Cloud Pub - Sub

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

M2 - Serverless Messaging With Cloud Pub - Sub

Uploaded by

Copyright:

Available Formats

Serverless Messaging

Cloud Dataflow Streaming

BigQuery and Bigtable Streaming

Qualities that Cloud Pub/Sub Availability

Qualities that Cloud Pub/Sub Availability

Qualities that Cloud Pub/Sub Availability

This demonstrates how Pub/Sub scales.

Qualities that Cloud Pub/Sub Availability

Cloud Pub/Sub is a HIPAA-compliant service, offering fine-grained access controls

In this example, the Subscription is subscribed to the Topic.

new employee HR Topic

It is an enterprise message bus. So, how does it work?

Subscription Subscription Subscription

Each Subscriptions guarantees delivery of the message to the service.

Subscription Subscription Subscription Subscription

Directory Facilities Account

Subscription Subscription Subscription Subscription

Directory Facilities Account

Topic Topic Topic

Subscription Subscription Subscription Subscription

Message Message Message Message Message Message

< Action Safe

Subscriber Subscriber Subscriber Subscriber Subscriber Subscriber

Topic Topic Topic

Subscription Subscription Subscription Subscription

Message Message Message Message Message Message

< Action Safe

Subscriber Subscriber Subscriber Subscriber Subscriber Subscriber

Basic pattern just a straight through, which is a queue.

(Fan in / Load Balancing) Multiple Publishers publishing to the same topic

Topic Topic Topic

Subscription Subscription Subscription Subscription

Message Message Message Message Message Message

< Action Safe

Subscriber Subscriber Subscriber Subscriber Subscriber Subscriber

Topic Topic Topic

Subscription Subscription Subscription Subscription

Message Message Message Message Message Message

< Action Safe

Subscriber Subscriber Subscriber Subscriber Subscriber Subscriber

Messages are stored up

Subscription SUB B Subscription SUB D

Publish Publish Pull

Topic Subscription Topic Subscription

Acknowledgement used Messages are stored up

● A subscriber ACKs each Pull subscription Push subscription

● Messages are stored for up to 7

● A subscriber can extend the

Title Safe >

Subscribers can work as a individually or as a group. If we have just one subscriber, it

gcloud pubsub topics publish sandiego --message Publish to topic

Then, we can publish the topic on the command line.

gcloud pubsub topics publish sandiego --message

Message Send attribute

gcloud pubsub topics publish sandiego --message

Message Send attribute

gcloud pubsub topics publish sandiego --message

Message Send attribute