Atsig Article Quantum Aug07
Atsig Article Quantum Aug07
Atsig Article Quantum Aug07
Today the main part of the troubleshooting process is manual, but ATSIG, a research project
headed by Danish software provider Wirtek in cooperation with Siemens Austria, has now
been defined under the framework of a cooperation agreement with the European Space
Agency, to develop a concept for automation of troubleshooting processes.
R
unning a Satcom network is quite a challenge: uncertainty is simply too high for these “deterministic”
A huge amount of extremely remotely located techniques to work. Too often, an input/symptom is
pieces of equipment have to play together to not available, or due to randomness it is “on” when it
ensure satisfaction of the customer who is would normally be “off”. In such cases, a decision tree
expecting a highly reliable end to end data connection. will either get stuck (missing value) or it will follow a
However, equipment can break down or can be incor- wrong branch and come up with a wrong conclusion
rectly set up, a thunderstorm can cause poor signal quality (random behaviour).
to/from a ground station, or external radio sources can The ATSIG project aims to solve automation of the
interfere with the carriers. These examples illustrate the Diagnosis Creation task by utilizing a technique from
complexity that the Satcom operator faces. the artificial intelligence (AI) domain known as Bayesian
networks. This technique is very robust towards miss-
The Troubleshooting Process ing input data, and has a built-in method for handling
The Satcom troubleshooting process is similar to trouble- randomness. It has shown to be quite useful for other
shooting in many other domains. It can be decomposed complex diagnostic purposes including both medical
into the following three parts: diagnostics as well as troubleshooting of computers,
1. Fault Detection: Identifying that some part of the printers, terrestrial radio networks, and huge electrical lo-
network has a problem (yet not knowing what the comotives. The resulting solution in ATSIG will increase
problem is). the level of automation in the troubleshooting process to
2. Diagnosis Creation: Investigation of the problem to also cover Diagnosis Creation as shown in Figure 2.
identify the root cause (what is the problem?).
3. Solution Deployment: Once the root cause of prob-
lems has been identified, a solution has to be deployed
onto the system.
The level of automation in the Satcom operator
troubleshooting process is typically limited to the Fault
Detection part, in which alarm systems and automated
scripts normally help monitoring staff to detect when the
network behavior is abnormal. The rest of the process re- Figure 2:Troubleshooting process using the ATSIG auto-diagnosis
concept
lies on human experience and knowledge to couple alarms,
measurements, and other pieces of information to identify
possible root causes and to carry out solution deployment. Solution Requirements
The level of automation is shown in Figure 1. In order to automate the Diagnosis Creation task, it is
necessary to analyze current processes and define
a number of requirements that the solution should
meet. The following list has been identified:
1. The system shall be able to diagnose a selected
subset of root causes with a similar (or better)
quality compared to a human troubleshooter
given the same input.
This is the basic requirement of an automated
Figure 1: Current Satcom operator troubleshooting process diagnosis system. Given some input in the
form of a set of symptom readings, the system
The reason why the Diagnosis Creation task has not shall be able to provide an output which is at
been successfully automated is due to system complexity: least as good as what the human troubleshooter
It is not possible to use typical diagnostic solutions like could achieve with the same input. However,
rule-based systems or decision trees, since the level of it will make sense to limit the amount of pos-
QUANTUM
AUGUST 2007 21
sible faults/root causes that the system can handle The example clearly demonstrates how the ATSIG
to a subset covering somewhere in the order of 90- project has met Requirement 2 of “replicating human
99% of all troubleshooting cases. The reason is that knowledge transfer”: Where the human language is rich
the remaining root causes are rare and thus the time and where you can stress that the problem “is VERY
spent to model these in the system may not be worth likely to be X because of …”, ATSIG replicates this
the effort as long as the system is able to output that by quantifying belief with probabilities and in addition
the problem is something else than the covered root monitoring the inputs (symptom readings). This also
causes. means that in some cases (like the presented example)
2. The system shall replicate human knowledge transfer. more work is required by a human engineer to determine
Imagine a human troubleshooter who is working to exactly which one, out of two-three root causes with
solve a troubleshooting case and has already analyzed high probability, is the true fault. This can be avoided
the problem based on the immediately available by specifying many inputs with a high degree of in-
symptom readings. Then, for some reason, he needs to formation on specific root causes. A good set of inputs
go home and hands over the case to a colleague. The will automatically increase the average certainty of the
handover will typically consist of a statement like: diagnoses, such that the amount of manual interaction
“My analysis shows that the most likely problem is is kept at a minimum. The number of symptoms in this
X because of …, but it could also be Y due to …”. example (three) is much too low to provide value in
The automated diagnosis system should be able to real-life cases.
replicate this type of communication. The knowledge base behind this example is defined
3. Human experts shall be enabled to store knowledge using a special Model Maintenance Tool, where a trouble-
about diagnosis creation. shooting expert has specified:
Since the key knowledge about how to identify the 1. A set of root causes and a set of symptoms/inputs.
root causes today is in the brains of the Satcom op- 2. For each symptom:
erator troubleshooting engineers, the system should a. SQL scripts for automatically retrieving the actual
provide an easy means for capturing such knowledge symptom readings/observations of the specific case
in a knowledge base/model. b. A mapping of symptom values to states (e.g.
4. New knowledge shall be stored and used to improve “low”, “normal”, “high”).
accuracy. 3. In order to compute the probabilities for the diagno-
When the automated diagnosis system has been taken sis:
into use, a lot of new cases will become available a. The prior probability of the root causes (looking
where both the symptom readings and eventually at all solved cases, how frequent each root cause is)
the actual root cause will be known. Such data is b. The conditional probability of each symptom
new knowledge/experience, which should be used given each related root cause
to improve the knowledge base.
By enabling the user to specify this “domain model”
we meet Requirement 3 in the previous section. In the
The proposed solution
ATSIG tool, the information is converted to a Bayesian
In April 2007, the first prototype of the ATSIG system network, which is a graphical representation of causal
was available, and some trials and demo sessions were and probabilistic relations, and which allows for ac-
conducted to validate functionality against require- curate computation of posterior probabilities of the root
ments. causes (what is the probability of X given that we have
Figure 3 shows the main diagnosis output window of seen Y and Z?).
the ATSIG system, where a diagnosis has been computed
for a carrier on the Aalborg site where a “Low EIRP”
alarm has indicated that some problem is present. In order
to diagnose the problem, the ATSIG system has retrieved
three “symptom readings” shown in the “Evidence”
panel. From these, a ranking of four specific root causes
plus “Other Problem” and “No Fault” is made from the
computation of their probability.
QUANTUM
AUGUST 2007 23