Atsig Article Quantum Aug07

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

VSAT network troubleshooting –

Networking & Databuses


Peer reviewed
capturing knowledge and using it
by Lars Moltsen1, Raquel Barco2, Pedro Lazaro2 and Sasikanth Munagala1
1) Wirtek A/S, Nibevej 54, DK-9200 Aalborg SV, Denmark
2) University of Malaga, E.T.S.I.Telecomunicación, 29071 Málaga, Spain

Today the main part of the troubleshooting process is manual, but ATSIG, a research project
headed by Danish software provider Wirtek in cooperation with Siemens Austria, has now
been defined under the framework of a cooperation agreement with the European Space
Agency, to develop a concept for automation of troubleshooting processes.

R
unning a Satcom network is quite a challenge: uncertainty is simply too high for these “deterministic”
A huge amount of extremely remotely located techniques to work. Too often, an input/symptom is
pieces of equipment have to play together to not available, or due to randomness it is “on” when it
ensure satisfaction of the customer who is would normally be “off”. In such cases, a decision tree
expecting a highly reliable end to end data connection. will either get stuck (missing value) or it will follow a
However, equipment can break down or can be incor- wrong branch and come up with a wrong conclusion
rectly set up, a thunderstorm can cause poor signal quality (random behaviour).
to/from a ground station, or external radio sources can The ATSIG project aims to solve automation of the
interfere with the carriers. These examples illustrate the Diagnosis Creation task by utilizing a technique from
complexity that the Satcom operator faces. the artificial intelligence (AI) domain known as Bayesian
networks. This technique is very robust towards miss-
The Troubleshooting Process ing input data, and has a built-in method for handling
The Satcom troubleshooting process is similar to trouble- randomness. It has shown to be quite useful for other
shooting in many other domains. It can be decomposed complex diagnostic purposes including both medical
into the following three parts: diagnostics as well as troubleshooting of computers,
1. Fault Detection: Identifying that some part of the printers, terrestrial radio networks, and huge electrical lo-
network has a problem (yet not knowing what the comotives. The resulting solution in ATSIG will increase
problem is). the level of automation in the troubleshooting process to
2. Diagnosis Creation: Investigation of the problem to also cover Diagnosis Creation as shown in Figure 2.
identify the root cause (what is the problem?).
3. Solution Deployment: Once the root cause of prob-
lems has been identified, a solution has to be deployed
onto the system.
The level of automation in the Satcom operator
troubleshooting process is typically limited to the Fault
Detection part, in which alarm systems and automated
scripts normally help monitoring staff to detect when the
network behavior is abnormal. The rest of the process re- Figure 2:Troubleshooting process using the ATSIG auto-diagnosis
concept
lies on human experience and knowledge to couple alarms,
measurements, and other pieces of information to identify
possible root causes and to carry out solution deployment. Solution Requirements
The level of automation is shown in Figure 1. In order to automate the Diagnosis Creation task, it is
necessary to analyze current processes and define
a number of requirements that the solution should
meet. The following list has been identified:
1. The system shall be able to diagnose a selected
subset of root causes with a similar (or better)
quality compared to a human troubleshooter
given the same input.
This is the basic requirement of an automated
Figure 1: Current Satcom operator troubleshooting process diagnosis system. Given some input in the
form of a set of symptom readings, the system
The reason why the Diagnosis Creation task has not shall be able to provide an output which is at
been successfully automated is due to system complexity: least as good as what the human troubleshooter
It is not possible to use typical diagnostic solutions like could achieve with the same input. However,
rule-based systems or decision trees, since the level of it will make sense to limit the amount of pos-
QUANTUM
AUGUST 2007 21
sible faults/root causes that the system can handle The example clearly demonstrates how the ATSIG
to a subset covering somewhere in the order of 90- project has met Requirement 2 of “replicating human
99% of all troubleshooting cases. The reason is that knowledge transfer”: Where the human language is rich
the remaining root causes are rare and thus the time and where you can stress that the problem “is VERY
spent to model these in the system may not be worth likely to be X because of …”, ATSIG replicates this
the effort as long as the system is able to output that by quantifying belief with probabilities and in addition
the problem is something else than the covered root monitoring the inputs (symptom readings). This also
causes. means that in some cases (like the presented example)
2. The system shall replicate human knowledge transfer. more work is required by a human engineer to determine
Imagine a human troubleshooter who is working to exactly which one, out of two-three root causes with
solve a troubleshooting case and has already analyzed high probability, is the true fault. This can be avoided
the problem based on the immediately available by specifying many inputs with a high degree of in-
symptom readings. Then, for some reason, he needs to formation on specific root causes. A good set of inputs
go home and hands over the case to a colleague. The will automatically increase the average certainty of the
handover will typically consist of a statement like: diagnoses, such that the amount of manual interaction
“My analysis shows that the most likely problem is is kept at a minimum. The number of symptoms in this
X because of …, but it could also be Y due to …”. example (three) is much too low to provide value in
The automated diagnosis system should be able to real-life cases.
replicate this type of communication. The knowledge base behind this example is defined
3. Human experts shall be enabled to store knowledge using a special Model Maintenance Tool, where a trouble-
about diagnosis creation. shooting expert has specified:
Since the key knowledge about how to identify the 1. A set of root causes and a set of symptoms/inputs.
root causes today is in the brains of the Satcom op- 2. For each symptom:
erator troubleshooting engineers, the system should a. SQL scripts for automatically retrieving the actual
provide an easy means for capturing such knowledge symptom readings/observations of the specific case
in a knowledge base/model. b. A mapping of symptom values to states (e.g.
4. New knowledge shall be stored and used to improve “low”, “normal”, “high”).
accuracy. 3. In order to compute the probabilities for the diagno-
When the automated diagnosis system has been taken sis:
into use, a lot of new cases will become available a. The prior probability of the root causes (looking
where both the symptom readings and eventually at all solved cases, how frequent each root cause is)
the actual root cause will be known. Such data is b. The conditional probability of each symptom
new knowledge/experience, which should be used given each related root cause
to improve the knowledge base.
By enabling the user to specify this “domain model”
we meet Requirement 3 in the previous section. In the
The proposed solution
ATSIG tool, the information is converted to a Bayesian
In April 2007, the first prototype of the ATSIG system network, which is a graphical representation of causal
was available, and some trials and demo sessions were and probabilistic relations, and which allows for ac-
conducted to validate functionality against require- curate computation of posterior probabilities of the root
ments. causes (what is the probability of X given that we have
Figure 3 shows the main diagnosis output window of seen Y and Z?).
the ATSIG system, where a diagnosis has been computed
for a carrier on the Aalborg site where a “Low EIRP”
alarm has indicated that some problem is present. In order
to diagnose the problem, the ATSIG system has retrieved
three “symptom readings” shown in the “Evidence”
panel. From these, a ranking of four specific root causes
plus “Other Problem” and “No Fault” is made from the
computation of their probability.

Figure 4: Another example model and data set


with 22 inputs and 15 different root causes

The ATSIG system scales well although in general


probabilistic reasoning has exponential complexity (poor
scaling properties). The reason is that by using certain
constraints on the model structure, the complexity is kept
linear. Figure 4 shows a more realistic model than the
Figure 3:The ATSIG diagnosis window one in Figure 3. This model is built for satellite control
QUANTUM
22 AUGUST 2007
diagnostics, and it has 15 possible root causes and 22 Benefits of ATSIG

Networking & Databuses


inputs. to the Satcom
Figure 5 shows the flow of data in the ATSIG sys- Operator
tem. The “Model” (the knowledge base) specifies to The current ATSIG pro-
the ATSIG reasoning engine how the actual symptom totype solution has now
readings (from alarms, events, and measurements) are been demonstrated to a
coupled to root causes. The ATSIG reasoning engine number of Satcom op-
produces diagnoses to the end users, and once the erators, both to validate
end user closes a case (by marking what the actual the proposed solution
problem was), the model is updated according to this and to reveal additional
new piece of knowledge using an “adaptation” step requirements of the fi-
using standard statistical calculations. This means nal solution. The initial
that new knowledge is continuously captured and feedback has been very
added to the model, which will improve over time, positive.
and this is also the direct solution to Requirement 4 Bo Hjorth Jensen,
in the previous section. the head of technology
The ATSIG project is still in “Phase 1”, where the in Emperion, a provider
system has been designed and the first available proto- of broadband network
type has been demonstrated to a number of operators. In solutions via satellite
order to verify that Requirement 1 was met (diagnosis mainly active in Europe,
performance at least as good as a human expert), a large- Middle-East, and Africa,
Figure 5:The ATSIG system data flow
scale trialing campaign is planned to be conducted in says: “A system that can
“Phase 2”, running until spring 2008. both capture knowledge
Bayesian Networks and automate our pro-
cesses will bring a lot of
This section presents the fundamental technology
value to my organization.
of the ATSIG project, namely Bayesian networks.
We will be less sensitive
Bayesian networks have been used in a range of di-
to the right people being
agnostic applications ranging from medical decision
at work, and we will be
support systems to printer and PC troubleshooting Figure 6: A Bayesian network for diagnosing
able to satisfy customers
(Hewlett-Packard and Microsoft) to locomotive child diseases
even better than today.”
troubleshooting (General Motors). Wirtek has also
Based on the operator
recently used this technology in TheCure, a system
feedback, the following
for troubleshooting the radio access part of GSM and
list of main benefits have
UMTS networks.
been identified:
A Bayesian network models the random variables of a
domain. Figure 6 shows a small Bayesian network, where 1. Saving time and re-
three disease variables are present (Measles, Mumps, and sources
Rubella). In addition, two symptom variables are also 2. Decreased time to
present (Fever and Spots). For simplicity, all variables solve problems
could be assumed to have two states, “yes” and “no”. The 3. Better customer sat-
model also contains causal links. A causal link means isfaction
that the state of a variable has a causal impact on the 4. Less sensitivity to hu- Figure 7:The diagnostic example with Fever
state of another variable – e.g. Measles would have a man knowledge and Spots observed to be “yes”
causal impact on Fever and Spots whereas Mumps only 5. Knowledge is cap-
impacts Fever. tured and stored
In addition to the graphical structure, the Bayesian
network has one (conditional) probability table per Summary
variable. The rule is that it should be the conditional Although the ATSIG project is still not completed, it
probability of the variable given each of its parents has already, in the prototyping phase, provided demon-
(incoming links). Thus, in Figure 6 there are 5 prob- stration results that indicates a great potential of using
ability tables, and e.g. the table of Spots will be: probabilistic reasoning technology/Bayesian networks
P(Spots | Measles, Ruballa) (the conditional probability for automation of the troubleshooting process. And not
of Spots given Measles and Rubella). For variables only will the toolset help speed up and standardize the
without incoming links, the probability table is simply troubleshooting process in the operator organization, it
the prior probability of each state of the variable – e.g. will also secure knowledge within the organization and
P(Measles) = [yes: 0.01; no: 0.99]. not leave the operator suffering if key personnel go on
Once the structure and conditional probabilities vacation or decide to leave.
are specified, the Bayesian network is ready to be used In Phase 2 of the project (autumn 2007 – spring 2008),
by using a combination of Bayes’ rule and properties the ATSIG toolset needs to be thoroughly validated in
of conditional independencies. Basically, any observa- cooperation with operators, and the usability and inte-
tion of a set of variables can be fed into the Bayesian gration properties must be further developed in order to
network and updated posterior probabilities can be provide an “off-the-shelf” troubleshooting product for
computed for all other variables. Figure 7 shows an Satcom operators.
example where both Fever an Spots have been ob- The project progress can be followed at www.atsig-
served to be “yes”. The effect is that the probabilities project.org, and further information can be acquired by
of Measles and Rubella become high (the reason why contacting Lars Moltsen from Wirtek.
Q
Rubella scores highest is that Fever was set to have Peer reviewed
a slightly higher correlation with Rubella, compared Lars Moltsen (+45) 25 21 46 35
to Measles). or [email protected]

QUANTUM
AUGUST 2007 23

You might also like