Juniper Apstra Architecture White Paper
Juniper Apstra Architecture White Paper
Juniper Apstra Architecture White Paper
JUNIPER APSTRA
ARCHITECTURE
Solve challenging data center issues with
purpose-built automation for the full
network life cycle
TABLE OF CONTENTS
Introduction.................................................................................................................................. 3
The Main Challenge: Composition........................................................................................... 3
Knowledge Is Power: Dealing with Change Reliably ........................................................... 5
Stateful Orchestration ............................................................................................................... 5
Extensibility: Future-Proofing the Data Center Network .................................................12
Scalability: Growth Without Pain ..........................................................................................14
Apstra Architecture Overview ...............................................................................................15
Benefits of the Apstra Architecture ......................................................................................17
Conclusion..................................................................................................................................18
About Juniper Networks..........................................................................................................18
EXECUTIVE SUMMARY
Building, deploying, operating, managing, and troubleshooting a data center network can be
difficult, expensive, and resource intensive. According to a 2019 Gartner report, simplifying IT
infrastructure is the Number 1 strategic priority for businesses today. Juniper® Apstra has been
purpose-built to automate the full life cycle of the data center network.
This paper describes some of the most difficult issues confronting data center networking
teams today and explains how Apstra’s architecture addresses each of them.
Introduction
Today’s data center networking architecture and operations teams face four main challenges:
• Composition. Putting together a coherent and smoothly working system with pieces of infrastructure from
different vendors, each with different capabilities, can be incredibly difficult. Juniper® Apstra creates a coherent
whole—a blueprint—consisting of all the information needed to deploy the system based on expressed intent.
The blueprint gets pushed to the physical infrastructure and allows an operator to deliver reliable and easy-to-
consume services.
• Reliable Change. Change is a constant in the data center. It can come from an operator trying to add new
services or expand capacity, or from the infrastructure in the form of failure conditions. Either way, operators
have to find better ways to deal with change. According to a variety of studies in recent years1, 65% to 70% of
network outages are caused by human configuration errors while executing a planned change.
• Extensibility. To encourage innovation and accommodate the inevitable rapid change in technology, data centers
need to be able to add new capabilities and functionality easily and smoothly. As technology evolves, vendors
innovate and come up with new features that operators leverage to stay competitive. This evolution causes a
design-time change, creating a need for new behavioral contracts—or reference designs—to govern the services
you want to deploy. In other words, your composition has to be extensible to allow further innovation or address
new requirements.
• Scalability. To address all of this at scale, data center networks must be able to grow and accommodate
innovation and increasing complexity.
The primary goal of the Apstra architecture is to address these issues directly.
Modern data centers are scale-out computers. They need an operating system that provides functionality analogous
to what a host operating system provides on a single machine today: resource management and process isolation.
Compute virtualization already does this for a single-server compute. But for data-center-as-a-scale-out compute, a
data center operator first needs to compose it; only then can you provide resource partitioning.
1
Examples include: https://www.computerweekly.com/news/2240179651/Human-error-most-likely-cause-of-datacentre-downtime-finds-study;
https://www.networkworld.com/article/3142838/top-reasons-for-network-downtime.html; https://www.ponemon.org/library/national-survey-on-data-center-outages.
Second, you may need to compose your infrastructure to deliver a variety of services, each with multiple functional
aspects such as reachability, security, quality of experience, and availability.
As you instantiate multiple service instances, interactions between these components can cause service outages—
unless you have a firm understanding of how to map these services into enforcement mechanisms. As the
capabilities of your infrastructure evolve over time, you need to be able to leverage and introduce new innovation
without breaking the existing systems.
In a reference design, the roles and responsibilities of the physical and logical elements are well defined, which
in turn binds the scope of modeling and enables the specification of a minimal, yet complete, model. A reference
design also governs how the intent is mapped into enforcement mechanisms. Understanding this mapping
enables automation of troubleshooting and provides powerful analytics rooted in knowledge of how the system is
composed, as opposed to reverse engineering what is happening in your network.
What are the roles and How are services What are the expectations
responsibilities of physical mapped into (situations to watch) that
and logical components? enforcement mechanisms? need to be met?
Reference design binds You can have different You can have standard
the scope and enables reference designs for or snowflake
specificaion of minimal different domains reference designs
yet complete models (data center, campus, edge)
In different reference designs, the same intent may be enforced with different, possibly newer and more innovative,
mechanisms. Understanding this mapping enables root-cause identification, which points operators directly to the
cause of service-affecting problems.
The reference design also specifies the expectations to be validated. For example, a reference design may stipulate
that each leaf should have a BGP session with each spine. This way, missing and unexpected BGP sessions are easily
identifiable.
Defining a reference design is a value-added activity that must be completed by a networking expert. With Apstra,
expertise is an explicit part of the system rather than something that lives only in an expert’s head. This allows
expertise to be modeled and inserted into the system, as opposed to being hardcoded.
1. Knowledge has to be interpreted in context, meaning that the relationship between conditions and expectations
is understood, as described in the Context Model section.
2. Knowledge has to be timely, meaning that it reflects current conditions, as described in the Real-Time Monitoring
and Notification section.
Stateful Orchestration
Stateful orchestration makes changes coming from the operator reliable using intent. A stateful orchestration flow of
actions can be seen in Figure 2.
User
Supplied Real-Time pre-conditions validated
Real-Time derived
Intent
resource pools
topological relationships
Enforcement Parameters
Rendered
Configurations
Rendered Managed
Expectations Infrastructure
(Devices, Controllers,
Third Party Systems)
Anomalies
AOS
Build Rendering Expectations Validations
Agents
If you want to make a change on top of your existing fully functional system, stateful orchestration allows the
user to supply only the intent for change. This is done in an implementation-agnostic way, which makes intent
specification simpler and less prone to errors. A complete set of steps executed during stateful orchestration
includes:
Step 1: Real-time precondition validation. Is this new request going to violate some policies? For example, are you
allowed to create this virtual network, or is it going to create some security holes? Or can you put a device into
maintenance mode knowing that some other devices are already offline and taking this device offline is going to
make your system vulnerable? With stateful orchestration, all these questions are asked and validated in real time
against the context graph.
Step 2: Real-time enforcement parameters derivation. In order to implement intent, specific parameters for
appropriate enforcement mechanisms will have to be supplied and activated on specific managed elements. Apstra’s
use of a reference design to spell out how intent is to be implemented, along with its understanding of the current
context, enables the system to automatically perform real-time enforcement parameters derivation. There is no
possibility of an operator accidentally inserting a wrong command, interface, IP address, or VLAN.
Step 4: Real-time expectations generation. A reference design acting as a behavioral contract allows Apstra to
perform real-time expectations generation, which describes conditions that need to be met in order to declare that
the outcome has satisfied the intent.
Step 5: Real-time expectations validation. Apstra then triggers a collection of telemetry and context-enabled
operational analytics to perform real-time expectations validation.
Step 6: Validated service outcome. At the end of this process, the user can observe a validated service outcome in
unambiguous terms. Every change in intent or expected operational status is reflected in the context model in real
time, and any component that needs to be aware of the change is also notified about it in real time. Apstra extracts
relevant knowledge from raw operational data using intent-based analytics.
With stateless orchestration, many of the steps are missing (as shown in Figure 3). Configuration deployment
outcome is what stateless orchestration is typically all about—pushing configuration to possibly multiple systems
and validating that the configurations have been accepted by managed elements. And this is where stateless
orchestration typically stops; there is no notion of service expectations or validation that these expectations
are met. Automated service outcome validation is absent, and instead, separate systems present a user with a
“single pane of glass” regarding the raw operational state. It is up to the user to perform visual discovery to find
out if the outcome has been achieved. This single pane of glass usually includes too much information and lacks
implementation-specific context, which makes visual discovery extremely difficult and subjective.
Rendered
Configurations
Managed
Infrastructure
(Devices, Controllers,
Third Party Systems)
Intent-Based Analytics
Intent-based analytics (IBA) helps operators deal with operational status changes in their infrastructure by extracting
knowledge out of raw telemetry data. As mentioned before, you must have real-time query-able context before you
can use IBA.
1. Detection of conditions of interest (situations to watch). Conditions detection is done by IBA probes.
1. Automation of classification of conditions of interest and deriving relationships between them. Conditions
vary in their semantic content. In other words, some of them are more important than others. It’s vital to
understand the relationships between conditions so that you can pinpoint important actionable conditions
(root causes) and understand which conditions are merely consequences that will disappear when important
root causes are taken care of. Condition classification and causality derivation are done by the root cause
identification component of IBA.
Interpreting measurement Anomaly that is caused by Indicated anomaly/ May not be observable
against expectation understood root cause symptom has an effect (bad)
on some element
Has tangible value If the absence of root cause May not be observable, Fixing it removes symptoms
cannot be fixed, only treated but useful to know and some or all of its impacts
Anomalies
Anomalies essentially represent an interpretation of measurement, or some aggregate of it, against some
expectation, and as such have more tangible value than simple measurement.
Symptoms
Symptoms are anomalies caused by a well-understood root cause. They are typically easily observable, but they
cannot be fixed, only treated, like giving aspirin to a patient treats a fever but not the illness itself. If you don’t treat
the illness, the fever will return. Importantly, a symptom will disappear when the root cause is fixed, so symptoms
are not actionable; they are simply useful for diagnosing the root cause.
Impacts
Impacts indicate that something has happened as a result of an anomaly. Impacts may not always be observable,
but they are useful to know about. For example, you may want to know that an important customer is going to be
impacted by a failure of a device or excessive packet loss before they call you to complain.
When impacts are not observable, they can be calculated based on the knowledge of the intent and the
enforcement mechanisms used to implement the intent. Understanding impacts helps operators prioritize which
root causes and anomalies have the greatest effect on the operation.
Root Causes
The most important of all conditions are root causes. They may not always be observable, making them difficult
to diagnose, but they cause many symptoms and impacts. They are actionable, as fixing the root cause eliminates
associated symptoms and impacts.
“It ain’t what you don’t know that gets you into trouble. It’s what you know for
sure that just ain’t so.” - Mark Twain
IBA probes are responsible for detecting conditions of interest and are driven by a behavioral contract that is
part of the reference design. IBA probes fetch data, apply some processing, and then compare the result against
expectations. An IBA probe is essentially a configurable data-processing pipeline that allows users to set up
conditions of interest (i.e., situations to watch).
For example, say an operator is interested in analyzing an equal-cost multipath routing (ECMP) imbalance that
applies only to fabric interfaces, and there were reports that a specific operating system version introduced a bug
in the ECMP hashing algorithm. A query can express the need to collect interface counters on every top-of-rack
switch but only for fabric interfaces and for switches that are running a particular version x.y.z of the switch OS.
Now that this situation to watch is set up, the operator does not need to worry about the changes. If a new fabric
link is added to a switch that matches the criteria, it will be automatically included in the analysis. If a new switch is
added, it will be automatically included. If someone upgrades another switch to version x.y.z, it will be automatically
included. The IBA probe requires no maintenance.
Stage Processors
In many situations, the operator is not interested in instantaneous values of raw telemetry data, but rather in an
aggregation or trends. IBA contains stage processors that aggregate information such as calculating average, min/max,
standard deviation, and so on. An operator can then compare these aggregates against expectations so that they can
identify whether the aggregate metric is inside or outside a specified range, in which case an anomaly is flagged.
The operator may then want to check whether this anomaly is sustained for a period of time exceeding a specific
threshold and flag the anomaly only when the threshold is exceeded in order to avoid flagging anomalies for
transient or temporary conditions. The operator can achieve this by simply configuring a subsequent stage to
contain what’s called a time-in-state processor.
In an Ethernet VPN (EVPN) environment, the intent will contain information such as the virtual networks that
exist, where their endpoints are, and information about enforcement mechanisms. This expected table can then be
compared against the one coming from the telemetry and anomalies flagged when a mismatch is found.
A probe can also be configured to track sudden changes in the sizes of forwarding and routing tables and alert when
trends are not as expected. The “what is expected” threshold can also be calculated dynamically from the intent,
as it can be a function of the number of virtual networks, the number of endpoints, the number of virtual tunnel
endpoints (VTEPs), the number of VTEPs with some issues, and so on. The possibilities are endless.
Probes can be declaratively created, activated, and deactivated with a simple REST call or through the use of a
GUI. Probe activation also serves as a trigger for telemetry. In other words, specific telemetry data is collected only
if there is a probe interested in it, so probes essentially serve as a telemetry collection configuration mechanism.
Queries in data source processors make configuration as granular and precise as needed, eliminating “data hoarding
disorder,” which happens when tons of data is collected without a clear idea about what to do with it. This, in turn,
drives the cost of storing and processing data through the roof.
Intent
Process
Custom
Telemetry Process Anomaly
Collector
Process
Once a probe is configured, the output of every stage is available via API, GUI, and streaming endpoints. Apstra
comes with a set of built-in probes, and there is a repository of open-source probes on GitHub. The user can also
create new probes from scratch.
Root-Cause Identification
Root-cause identification (RCI) is a mechanism to automate the classification and derivation of causality relationships
between conditions that are identified in the infrastructure. The main benefit of RCI is surfacing root cause
conditions that require operator action from the sea of non-actionable conditions that are merely consequences of
root cause faults. For example, suppose an operator has instrumented the infrastructure well and is observing a log
of anomalous conditions on a “single pane of glass” console (see Figure 6). The red dots indicate different types of
conditions scattered across the infrastructure occurring in many different elements.
The problem with this picture is that it is not actionable—it is simply a lot of noise. In the absence of classification,
each condition is vying for the operator’s attention, but which one needs to be acted on? This is where RCI comes
into play. Armed with the knowledge of context, which says what the service is, how it is implemented and mapped
into enforcement mechanisms, and how the elements in the managed infrastructure are related, RCI identifies and
classifies the root cause(s) and related symptoms and impacts. As shown in Figure 7, the results of the analysis using
RCI are:
• The root cause was a memory leak on a switch called “spine_1” (root cause: large red dot).
• This memory leak caused the out-of-memory process killer to act on and kill a few processes, one of which was a
routing process (symptoms: dark blue dots).
• This, in turn, caused BGP sessions to fail on this and the peering devices, along with missing expected routing
table entries (symptoms: dark blue dots).
• As a result, endpoints belonging to customers X and Y experienced connectivity issues (impacts: light blue dots).
This is your
only actionable
event
RCI automates the complex mental process involved in a “single pane of glass,” which overloads operators with
mountains of information and leaves it up to them to perform visual discovery and correlation. Instead, RCI presents
an operator with a simple pane of glass, which simply identifies the actions needed.
Context Model
Having all the data doesn’t mean you have all the answers. And collecting huge amounts of data without automating
the extraction of knowledge is an invitation to an OpEx explosion, as your experts will be spending their valuable
time working on making sense out of data. Data in itself has limited value unless it gives you knowledge or answers
the right questions.
Raw telemetry data is typically stored in key-value stores, which are well suited for horizontal scaling and sharding.
But they don’t support queries other than simple key lookups. To extract knowledge, you would need to build
support for queries and have a context to construct these queries.
On the other hand, SQL data stores do support complex queries. The problem is that SQL tables are designed
around anticipated queries, which are created during design. Should you want to ask some other question at
run time, you may de-normalize the database, or the queries may end up with a lot of joins, which will cause
performance to grind to a halt. But extraction of operational knowledge is all about queries that come up at run
time. This is where graph-based implementation of contextual data shines—supporting arbitrary complex queries
with large numbers of “joins” at run time. The graph model can also be easily extended by simply adding more
instances of basic building blocks, nodes, and relationships.
Let’s take a closer look at what goes into this context, since it will help explain the power unleashed by the queries.
• Design artifacts: First, the intent-graph context contains design artifacts. For example, if you are designing a data
center network, it may contain the desired oversubscription factor, desired number of servers, and desired high
availability (HA) configuration (single-attached, dual-attached).
• Resource allocation: This contains decisions related to allocation of resources. It governs whether you are
managing your infrastructure as “cattle” or “pets.”
• Isolation policies: This also specifies isolation policies. Is it allowed to reuse certain resources (IPs, ASNs, VLANs)?
• Segmentation policies: This contains segmentation policies specifying which endpoints and workloads can talk to
each other and under what conditions.
• Enforcement information: This contains enforcement information about where (on which physical or virtual
appliances) and how (using which mechanisms) the segmentation and reachability policies were actually
implemented.
• External reachability policies: In the case of the data center, this may also contain external reachability policies,
specifying which inside endpoints can be seen by the outside world, and vice versa. It also contains which tenants
own which segments, the service levels promised to each of them, and the service-level objectives.
• Business perspective: This covers the business perspective, such as the service-level agreements and how much
it will cost you if you don’t meet them.
8-D View
Even though all these artifacts are represented in a single graph, that single source of truth is logically a single
multidimensional space, with at least the following eight dimensions (derived specifically from this example; there
could potentially be even more):
• Design
• Resources
• Isolation
• Segmentation
• Enforcement
• External reachability
• Service levels
• Business perspective
As a result, when things go wrong, you can have a single query operate across this 8-D view and answer complex
questions such as: “Given the recent failure condition on an element, are my design goals still valid?” “Are the
correct resources used on correct enforcement points, with desired isolation policies?” “Does it impact segmentation
policy in any way?” “Does it honor the service-level objective?” “What price am I paying for this failure?”
Existence of this 8-D view is not a “nice to have”; it is a prerequisite for reliable operation. With the Apstra Intent-
Based System, the 8-D view is built in. Without Apstra, you have to reconstruct this 8-D view with a very complex
layer, spanning multiple sources of truth and tribal knowledge that exist in the minds of your experts. Building this
layer, in an environment where individual sources of truth were not built with integration in mind and have different
semantics and behaviors, is undifferentiated heavy lifting in the best case and an unmanageable nightmare in the
worst case.
As you will see in the Architecture Overview section, the same pub/sub paradigm that is presented at the user level
is supported by an equivalent low-level pub/sub mechanism. This mechanism is implemented by the distributed
data store acting as a data-centric logical communication channel for the Apstra system processes that implement
application logic.
Self-Operation
Last but not least, there is a need to automate the reaction to some events in order to remediate problems, log
them for forensic analysis, or do a next-level drill-down to collect telemetry that helps with root cause analysis.
An automated reaction:
• Reduces risk, as remediation parameters are automatically derived from the up-to-date single source of truth and
are not subject to misconfiguration or stale data
• Improves customer experience, since the remediation happens in a timely manner
• Reduces operational cost, as it eliminates the need for manual troubleshooting and manual execution of the
remediation playbook
The hurdles to a self-operating network stem more from organizational or individual resistance
than from technological limitations, as the required functionality already exists.
Ultimately, an automated reaction enables network self-operation and self-healing. The hurdles to a self-operating
network stem more from organizational or individual resistance than from technological limitations, as the required
functionality already exists.
1. Reference design extensibility. This governs how the pieces of infrastructure work together to deliver on the
intent. It includes plug-ins that allow modifications of the graph model, as well as modifications related to how the
intent is mapped into enforcing mechanisms in the infrastructure.
2. Analytics extensibility. This is related to the definition of new conditions or situations to watch and allows a user
to define new validations, as well as how to classify and relate them.
3. Flexible service APIs. These provide the ability to extend the top-level service definitions.
Reference Design
As mentioned earlier, reference design is a behavioral contract that defines how intent is mapped to enforcement
mechanisms and what expectations must be satisfied in order to deem intent fulfilled. Any aspect of this contract
can be modified or extended. New node and relationship types can be defined, and configuration templates and
resource allocation mechanisms can be altered.
Analytics Extensibility
New analytics functionality can be introduced via two mechanisms:
1. New IBA probes can be defined that detect new conditions of interest. They may include new telemetry
collectors, as well as condition-specific data processing pipelines. Probes also can be published and subsequently
imported from a public repository.
2. New RCI models can be defined and loaded into a system. An RCI model is essentially a mapping of a new type
of root cause to a set of symptoms that it creates. Once this model is defined and loaded, the Apstra system
automates root cause identification based on observed symptoms.
Service APIs
Apstra exposes service-level APIs in the form of group-based policies, providing the flexibility to support a wide
range of services and policies in an implementation-agnostic way.
With group-based policies, intent is expressed as a graph, representing endpoints that are placed into groups
(member relationships) with the purpose of expressing intent for some common behavior. Policies are instantiated
and related to groups or individual endpoints to define that behavior.
Policies can relate to a group in a directional (from/to relationship) or nondirectional (applies-to) manner. Policies are
a collection of rules. Rules may have a next-rule relationship when ordering between the rules is important. Groups
can be composed of other groups (hierarchy). Groups can also have a relationship to other groups to express some
constraints (such as “these endpoints/groups are behind these groups of ports”). Endpoints, groups, policies, and
rules can be thought of as building blocks for expressing connectivity intent.
Rules
Bound to Shadows
To
Endpoint Group From Policy Rule
Server L2 HA ...
L3 Affinity
QoS
Auto-Scaling
External
Endpoints are elements of your infrastructure that are subject to policies and, as such, are quite general in their
construction. They can represent an interface (physical or virtual), server, virtual machine (VM), container, or
application endpoint. The blue arrows in Figure 8 indicate “logical” inheritance. Endpoints contain parameters that
define them more precisely (e.g., interface name, port number, serial number, and hostname, VM UUID, container
IP, application protocol, UDP/TCP port number). Endpoints can also represent elements not managed by Apstra
(external endpoints) and as such are used to express constraints from the external systems with which Apstra needs
to interact (e.g., external router IP/ASN).
Endpoints are placed into groups. Any given endpoint can be a member of multiple groups, each expressing different
aspects of intended behavior. For example, a group could indicate:
Groups can also have overloaded semantics. For example, endpoints that are members of the L2 domain group
are members of an L2 broadcast domain (subnet). Similarly, endpoints that are members of the L3 domain have L3
reachability.
Groups can also be composed of other groups. Endpoint members of the group may be explicitly instantiated, or
there may be a dynamic membership specification as part of a group definition where endpoints are implied from
the specification present in the group.
For example, an L3 domain may have Classless Interdomain Routing (CIDR) or subnet as a property; therefore, all
endpoints with IPs in that range/subnet are implicitly members of the group. In that case, the L3-to-the-server
reference design specifies a number of external/internal subnets. Even though endpoints (containers) are not
explicitly specified and managed in Apstra, it is implied that containers do belong to the specified L3 domains. But in
order to model granular segmentation, we are also going to introduce explicit specifications of container endpoints
and the servers they are hosted on.
A policy defines common behavior and can contain parameters to define that behavior precisely. Policies can be
applied to the group in a nondirectional manner, or direction can be indicated when needed, such as in security
policies. Examples of specific policies include security, load balancing, HA, affinity (e.g., the desire to colocate
endpoints or to distribute them), and quality of service (QoS).
Policies can contain rules when needed. Rules typically follow the “condition followed by an action” pattern. For
example, in the security policy, “match” is a condition statement, and action is “allow/deny/log.”
Scaling Processing
The second dimension of scaling is processing. Apstra can launch multiple copies of processing agents (per agent
type) if and when required that will share the processing load. More agents can be added by adding more servers to
host them. It also manages an agent’s life cycle.
The system’s state-based pub/sub architecture allows agents to react (provide application logic) to a well-defined
subset of state. Coverage of the whole intent is done through separate agents delegated to dealing with different
subsets of state. This means that when there is a change in the intent or operational state, the agent’s reaction is to
“incremental change” and is independent of the size of the whole state.
Apstra employs the traditional approach to deal with scale and associated complexity: decomposition. The “everyone
knows everything” approach doesn’t scale. You have to distribute the knowledge about the desired state and let each
agent determine how to reach that state to avoid the need for centralized decision making. Apstra support for live
graph queries implies that clients such as UI can ask for exactly what they want and get exactly what they need—
nothing more—allowing granular control of the amount of data to be fetched from the back end.
Fault tolerance is achieved by executing an Apstra application as multiple processes, possibly running on separate
hardware devices connected by a network and separating the state from processing with support for replicated
state and fast recovery of state.
Every Apstra reference design application is simply a collection of stateless agents as described above. Broadly
speaking, there are three classes of agents:
1. Interaction (web) agents are responsible for interacting with users, i.e., taking user input and feeding users with
relevant context from the data store.
2. Application agents are responsible for performing application domain-specific data transformations by
subscribing to input entities and producing output entities.
3. Device agents reside on (or are proxies for) a managed physical or virtual system, such as a switch, server, firewall,
load balancer, or even controller. They are used for writing configuration and gathering telemetry via native
(device-specific) interfaces.
Intent Apstra
p/s Device
App p/s Agent
Agent* Data Server
p/s Store p/s Proxy API
App
Agent* Agent
This interaction can be illustrated with an example describing a portion of Apstra’s data center networking reference
design application.
Expectations Expectations
Device
Rendering Telemetry
Status
Anomaly
Detection Anomalies
Figure 10: Apstra Intent-Based System data center networking reference design application
The web agent takes user input—in this case, a design for an L3 Clos fabric that contains the number of spines,
leafs, and links between them, as well as the resource pools to use for fabric IPs and Abstract Syntax Notation (ASN)
numbers. The web agent publishes this intent into the data store as a set of graph nodes and relationships and their
respective properties.
Then, assuming the validations pass, the build agent publishes that intent along with resource allocations into the
data store:
1. The configuration rendering agent subscribes to the output of the build agent.
2. For each node, the configuration agent fetches the relevant data, including resources, and merges it with
configuration templates.
3. The expectations agent also subscribes to the output of the build agent and generates expectations that need to
be met in order to validate the outcome.
4. The device telemetry agent subscribes to the output of the expectations agent and starts collecting relevant
telemetry.
5. IBA probes process the raw telemetry and compare it against expectations and publish anomalies.
6. The RCI agent analyzes the anomalies and classifies them into symptoms, impacts, and identified root causes.
Agents communicate via attribute-based interfaces (hence the term data-centric) by publishing entities and
subscribing to changes in entities. Data-centric also implies that data definition is part of the framework and is
implemented by defining the entities, as opposed to message-based systems, for example.
The data-centric publish-subscribe system does not suffer from the problems of message-based systems. In a
message-based system, sooner or later the number of messages exceeds the capacity of the system to store or
consume them. Dealing with this is hard as one has to replay the history of messages to get to a consistent state.
The data-centric system is resilient to surges in state changes as it is fundamentally dependent only on the last
state. This state captures the important context and abstracts away all the possible (and irrelevant) event sequences
that led to it. Code written using state machine paradigm is easier to read, maintain, and debug.
Hard problems (e.g., elasticity and fault tolerance) are solved once and on behalf of all agents. Typical architecture
then consists of a number of stateless agents that can be restarted in case of failure and that pick up where they left
off by simply rereading the state they subscribe to from system database (SysDB).
• Helps operators deal with change reliably. This is possible through real-time query-able intent and operational
context.
• Simplifies all aspects of network service life cycle, including Day 0, 1, and 2 operations. And simplicity reduces
the likelihood of operator error.
• Reduces operational risk through stateful orchestration, which leverages precondition validations, post-condition
validations, automated configuration rendering, and automated expectations validation.
• Reduce mean-time-to-repair and subject matter expert (SME) workload through simplified operational analytics,
freeing your most skilled resources to spend their time improving and innovating rather than fighting fires.
• Extract more knowledge by collecting and storing less data. Powered by the reference design behavioral contract,
Apstra knows what it is looking for and collects only that information. This on-the-fly processing can result in a
five to six order of magnitude reduction in storage needs and post processing of non-interesting data. This allows
you to run your infrastructure more efficiently and remain competitive while controlling costs.
• See issues quickly and easily. Apstra identifies significant, actionable events in a “simple pane of glass,” and it
eliminates the noise of symptoms that are merely artifacts of an identified root cause.
• Automate complex workflows. The Apstra system allows you to automate context-rich troubleshooting
workflows (which are otherwise cumbersome, ineffective, time consuming, and costly).
• Achieve zero-touch maintenance. Since it is in constant sync with intent, Apstra enables zero-touch, zero-cost
maintenance in the presence of change. As such, it is resilient, automatically responds to changes, and saves huge
costs associated with maintaining data processing pipelines.
• Eliminate costly do-it-yourself (DIY) development. DIY development of data-processing pipeline integration
efforts is costly and fragile, takes time and resources away from your core business, and requires a lot of SME
heavy lifting.
• Deliver highly accurate results compared to machine learning/artificial intelligence approaches.
• Employ a vendor-agnostic approach, giving you freedom of choice among the best vendors and capabilities.
• Use common APIs across your public/private/hybrid cloud infrastructure.
• Physical infrastructure: 6000 interface status validations; 1000 cabling validations; 36,000 error counters;
10,000 power, temperature, voltage metrics validations
• L2/L2 data plane: 12,000 queue drop counters; sanity of hundreds of MLAGs; 3000 Spanning Tree Protocol
(STP) validations; notifications of status changes
• Control plane: 1500 BGP sessions health; ~500 expected next hops/default routes
• Capacity plan: ~500 trending analyses with configured thresholds for the use of routing tables; Address
Resolution Protocol (ARP) tables; multicast tables per virtual routing and forwarding (VRF); 6000 link usage
validations
• Compliance: Ensuring expected OS versions running on all 100 or so switches
• Multicast: 2500 validations for expected Physical Interface Module (PIM) neighbors; 300 rendezvous point
checks; 25 validations regarding detection of abnormal patterns in count of sources, groups, source-group pairs
on rendezvous points
Essentially, instead of a single pane of glass with 82,000 entries, the customer was presented with a simple pane of
glass that only showed anomalies, categorized into a dashboard per the customer’s specification, which:
Conclusion
The inability to make reliable changes in your IT infrastructure is a major obstacle to growth and innovation. Apstra
eliminates that fear and makes it possible to eradicate the dreaded “legacy infrastructure” forever, empowering you
to make changes reliably, which will in turn allow you to innovate reliably and stay competitive.
Copyright 2021 Juniper Networks, Inc. All rights reserved. Juniper Networks, the Juniper Networks logo, Juniper, Junos, and other trademarks are registered trademarks of Juniper Networks,
Inc. and/or its affiliates in the United States and other countries. Other names may be trademarks of their respective owners. Juniper Networks assumes no responsibility for any inaccuracies in
this document. Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this publication without notice.