A New Approach For Testing Voice Quality

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

A New Approach for

Testing Voice Quality


sQLEAR Q&A
Content
Why sQLEAR? 4

What is sQLEAR? 5

What techniques does sQLEAR use? 5

What is the sQLEAR learning and evaluation process? 6

Which parameters does sQLEAR use? 7

How does sQLEAR work in the field? 8

Does sQLEAR work for different languages? 9

What is the accuracy of sQLEAR? 9

How can sQLEAR be compared with other solutions? 9

What are the differences versus other voice QoE solutions? 10

What are the similarities with other voice QoE solutions? 11

More questions? 11
WHITE PAPER

Why sQLEAR?
To meet the needs of today’s evolving mobile networks,
there is a growing need for flexible, real-time, automated
QoE-centric service evaluation, troubleshooting, and
optimization. This has been driven by a number of
factors. First, the volume of 4G subscribers has grown
dramatically. Second, the range of 4G services has also
increased. Finally, 5G network rollout is underway, bringing
significantly increased network complexity, an even greater
number and a larger variety of devices as well as more
service diversity.

It is well known that mobile video consumption has


exploded. Today, video services represent about 70% of Figure 1. Prediction on VoLTE subscriptions expansion
mobile data usage - and are expected to grow further
with 5G. Despite the outstanding success of mobile video
services, it’s often overlooked that voice services still As a result, MNOs must not only support the expected
deliver approximately 70% of Mobile Network Operator growth in video traffic, but they must also protect their
(MNO) revenue. However, this valuable revenue is existing voice revenues and eventually increase these
challenged by two main intertwined trends, the sustained through VoLTE services expansion to consumer and
OTT voice services’ expansion and the fast increase of enterprise communications supported on a variety of new
VoLTE subscriptions number along with VoLTE support
devices. The performance and quality of voice services as
on new type of devices such as smartwatches, Cat-M1-
experienced by users is a significant factor in ensuring that
capable Internet of Things (IoT) chipsets. In addition,
they continue to use MNO voice services rather than OTT
VoLTE technology will enable the 5G voice solution, while
alternatives.
representing the base for interoperable consumer and
enterprise communication services on different devices Maintaining the voice service Quality of Experience (QoE),
across LTE, Wi-Fi and 5G. reducing subscriber churn to OTT providers, and growing
voice revenue through VoLTE expansion, while optimizing
The continued strong competition from OTT voice
applications, is predicted to grow exponentially with 5G CAPEX/OPEX are therefore key concerns for MNOs.

deployments, presenting a continuously persistent threat Obtaining accurate, easy to implement and controlled
to MNOs’ voice service revenue. The powerful MNOs’ voice QoE predictors, as well as securing the ability to act
counter to this threat is the expansion of the carriers VoLTE on these in real-time, are thus crucial for enabling cost
services. The GSA report, April 2018, states that VoLTE efficient, optimized network operations that will meet and
has now been launched in more than 145 networks in over maintain customer expectations and demands.
70 countries across all regions. The findings presented in
Infovista’s sQLEAR voice QoE predictor is a new and
the Ericsson Mobility Report, June 2018, show that at the
unique solution, which is specifically designed to answer
end of 2017, VoLTE subscriptions exceeded 610 million.
The Ericsson’s findings also project that the number of these concerns and goals, benefiting thus MNOs and

VoLTE subscriptions will reach 5.4 billion by the end of regulators alike. It provides cost-effective and accurate
2023, accounting for around 80 percent of combined LTE evaluation of voice quality trends and enables predictions
and 5G subscriptions. Figure 1 shows VoLTE subscriptions of future QoE delivery. In addition, it can be used to
prediction per region. perform monitoring, benchmarking, and troubleshooting
of MNO voice services. sQLEAR ensures operational
efficiency for MNOs through effective troubleshooting of IP
networking and the underlying transport layers.

4
WHITE PAPER

What is sQLEAR? sQLEAR algorithm is machine learning based and it


uses as input speech reference sample(s), but not the
ITU-T standardized and currently available voice QoE resulting degraded speech samples. Therefore, sQLEAR
evaluation algorithms can be perceptual and parametric. uniquely becomes the first parametric intrusive voice QoE
Perceptual algorithms are based on human perception evaluation algorithm. The parametric intrusive approach
and cognition models, which are either intrusive or non- innovatively ensures sQLEAR with superior accuracy when
intrusive. Perceptual intrusive algorithms (ITU-T P.863) use compared to existing parametric non-intrusive solutions
test (or reference) speech samples sent over the network (ITU-T P.564) and/or perceptual non-intrusive (ITU-T P.563)
as well as the resulting degraded speech sample in order as well as with competitive performance against intrusive
to provide a QoE score, which represents the estimator perceptual solutions (ITU-T P.863).
of the subjectively perceived voice quality as defined by
sQLEAR does not use the degraded speech sample, and
MOS (Mean Opinion Score, ITU-T P.800, P.800.1, P.800.2).
therefore does not use the audio path for QoE prediction.
Perceptual non-intrusive algorithms (ITU-T P.563) do not
This confers the advantage of avoiding device specific
require a reference speech sample, but they rather use
degradations and characteristics, such as background
only the resulting degraded speech sample in order to
noise, automatic gain control, voice enhancement
estimate MOS. The parametric voice QoE algorithms
techniques, and frequency response. Therefore, sQLEAR
(ITU-T P.564) use only network parameters to estimate
enables operators to cost efficiently focus on network
MOS, and therefore these solutions are non-intrusive.
rather than on device specific performance.
In addition to the different models and approaches used by
sQLEAR output is represented in terms of MOS (as
these algorithms, it is important to mention that only ITU-T
defined by ITU-T P.800.1, MOS-LQO) and it represents the
P.863 can be used for evaluating VoLTE scenarios; neither
first outcome from ongoing activities in the P.VSQMTF
P.563, nor P.564 has been designed for VoLTE.
(Voice Service Quality Monitoring and Troubleshooting
sQLEAR has been coined as an abbreviation of Speech Framework) work item from ITU-T Study Group 12.
Quality (by machine) LEARning. It is an algorithm which
Designed for current and future voice services, sQLEAR
predicts the impact on the perceived voice quality that
can be used for the evaluation of VoLTE services that
results from IP transport and underlying transport (packet-
employ the High Definition (HD) Enhanced Voice Service
based radio and core network parameters), as well as the
(EVS) codec and client, including the channel aware (CA)
codec and jitter buffer in the end-user voice client (with
and Inter- Operability (IO) modes. The IO mode ensures
consideration of codec/client parameters). In addition,
backwards compatibility with AMR codec.

What techniques does


sQLEAR use?
sQLEAR is based on several key factors. These include
the transmitted speech reference; transport protocol
information, such as jitter and packet loss; and codec
information, which includes rate and channel-aware mode
(EVS codec case). The prediction algorithm uses deep
packet inspection (DPI) to obtain relevant information. This
means that the impact of the network on voice QoE can
be determined without the necessity of recording actual
speech content. The time characteristics of the reference
signal are used to identify the importance of individual
sections of the bitstream in regard to speech quality. This
offers the advantage of being able to take into account the
real voice signal after the jitter buffer.

5
WHITE PAPER

Network centric
Reference voice sample prediction
QoE
Mos predictor

RTP/IP packet stream

Machine learning

EVS codec / client information

Figure 2. sQLEAR concept

sQLEAR uses state-of-the-art machine learning to build obtained from the analysis of real-life field data, collected
a model that describes the speech quality perceived by during a significant number - and broad diversity - of drive
users based on all of these information resources (Figure 2) tests in different locations, conditions, and in a number of
MNO networks. The standardized EVS VoIP client ensures
The unique machine learning techniques offer three
that all devices with embedded EVS exhibit the same
clear advantages. First, the complexity of the inter-
behavior. As a result, sQLEAR is completely transparent to
dependencies between all network/codec/client
and independent of the devices used in testing.
parameters, as well as their significance in impacting the
speech quality, is better described and processed by Figure 3 depicts the simulation chain. A reference audio
machine learning algorithms than by the multi-dimensional file is injected into the simulation process, and coded with
optimization techniques required for the estimation of new different EVS codec settings for bandwidth, codec rate
coefficients of multi-variable non-linear functions, which and channel-aware mode. The resulting VoIP file output
are generally used for parametric voice QoE evaluation includes audio packets coded together, with an ideal arrival
algorithms. Second, any time changes that emerge time increasing by 20ms for each packet. Network errors,
from the introduction of new codecs/clients need to be in the form of jitter and packet loss patterns, are applied to
accounted for, machine learning techniques are more the coded audio to simulate degradations that may occur
flexible and quicker to tune. This provides a significant in an all-IP network.
advantage from the perspective of implementing the
To simulate the jitter and packet loss behavior of a radio
algorithm and ensuring operational efficiency. Third, there
and core network, jitter files are created by using a
is no need for additional calibration to the MOS scale
combination of simulations and drive test data. A large set
using first or third order polynomials, because the machine
of databases, spanning approximately 120,000 samples
learning based algorithm “learns” the precise MOS scale
and covering a broad range of conditions that generate
that it needs to predict.
voice degradations for the entire voice quality range, have
been generated. These conditions include:

What is the sQLEAR •• “Live (drive test) data modulated with simulations”,

learning and evaluation to broaden the range of conditions (e.g. randomizing


the live degradation’s position and amplitude);
process? •• “Gilbert burst packet loss and burst jitter up to 30%”,
sQLEAR has been developed using a simulator that is to mimic error cases that are observed in live drive
based on 3GPP standardized code for both the EVS test data. For example, during handover, during
VoIP client and the EVS voice codec. The simulations are which packets are buffered and then suddenly
performed at the IP level, and are based on knowledge released;

6
WHITE PAPER

•• “Gilbert severe burst jitter to 70%”, to improve the


Which parameters does
sQLEAR use?
learning and testing of large jitter cases, which are
the most difficult to predict;

•• “Random packet loss and random jitter”, to handle As a machine learning-based technique, the inputs
reordering of packets; and that sQLEAR uses are defined by features which are

•• “Manually designed packet loss”, to simulate mobile aggregated from basic network parameters. sQLEAR

devices that move in and out of coverage, which uses machine learning both for the creation and selection

results in long and short consecutive packet loss. of features, as well as for the QoE prediction performed
which, in turn, is based on the selected features.

By applying the jitter files and simulating network There are two sources of features. First, the information
degradations, EVS frames are removed when there is derived from the RTP stream generated by the simulated
packet loss and the arrival time of the frames are changed jitter buffer implementation. Second, statistical measures
relative to the jitter file. The new Jittered VoIP file is then built from the RTP stream, which proved through extensive
submitted to the EVS jitter buffer. It is decoded and time testing to have a significant impact on the accuracy of the
scaled, which produces a degraded audio file. Finally, algorithm in comparison to ITU-T P.863.
the degraded audio file is graded using ITU-T P.863 and
There are a number of factors that play an important role
compared it to the original reference, resulting in a MOS
in the MOS estimation process. These include speech
score.
content and frequency; and the duration of silence, as
As a result, each simulated jitter file that describes a well as its distribution within voice samples. These have
network condition has a corresponding degraded audio significant impact on the performance of sQLEAR and,
file and an associated MOS score. The 120,000 samples as a result, the accuracy of QoE prediction. Therefore, to
represent the databases used for sQLEAR learning and improve sQLEAR performance further, audio reference-
evaluation, with a 50%-50% split, as recommended by based features are also used.
current academic research in machine learning (see
These features are weighted based on the location of
“Handbook of Statistical Analysis and Data Mining
the feature (“position-based feature”) in the reference
Applications”, by Elsevier Publisher, 2009)
voice sample. The position is described either by a
sQLEAR uses a combination of bagged decision trees and feature giving the position of, for example, a dip in packet
SVM, (support vector machine) machine learning algorithm loss or by weighting the number of frame erasures. The
categories. These proved to provide the best performance most successful weighting function proved to be the
(for correlation and prediction error) when compared to rms (root mean square) of each 20ms voice frame in the
ITU-T P.863, a point addressed under the question of reference voice. It should be noted that information on the
accuracy later in this paper. reference voice is used, and not the recording. In addition,

Packet
Jitter
loss

VoIP file Jittered VoIP file

EVS Decoding
Reference EVS Network ITU-T
Coding Jitter Time MOS
audio (Radio and core) Decoding P.863
buffer scaling

Figure 3. Simulation chain

7
WHITE PAPER

the weighting process does not involve any perceptual


How does sQLEAR work in
the field?
processing of the voice sample. Consequently, sQLEAR
does not depend on the audio path.

In order to ensure that the jitter files are independent of As a QoE predictor of MOS, sQLEAR has to meet ITU-T
codec and reference voice sample, and to simplify feature requirements for test set-up, run-time, and/or measurement
creation, sQLEAR performs two pre-processing operations: [P.863.1]. However, in order to ensure the best performance
DTX cleaning and the addition of codec information (e.g. of the predictor against ITU-T P.863, some specifics need
rate, mode ). The output of the pre-processing is a new, to be considered.
“DTX cleaned”, jitter file, which also contains codec audio
Reference voice samples. The machine learning-based
payload size and Channel Aware mode data (see Figure 4).
QoE predictor has been trained using one or more
The DTX cleaning pre-processing handles the fact that reference files. Therefore, during run-time, when the
DTX periods, which occur during silence, do not impact the algorithm is deployed, the same reference file(s) must
perceived voice quality. Consequently, the pre-processing be injected to the network under test. The measurement
operation creates a new jitter file with no packet loss methodology and reference file(s) requirements follow
and no jitter during DTX periods, which greatly simplifies ITU-T P.863/P.863.1 specifications.
feature creation.
Pre-processing during run-time. The pre-processing
Add–on codec information represents the second pre- phase is performed in the same way as in simulations,
processing step. The codec information is added to the as described above. In addition, during run-time, the pre-
“DTX cleaned” jitter file. This consists of the audio payload processing synchronizes the reference voice and the IP/
size and Channel Aware mode indication. Packet payload RTP stream to ensure that the position-based features
size indirectly provides both codec rate and indicates if reflect the reference voice sample(s). This is performed by
DTX was used. This information is given for every packet, correlating the pattern of DTX and voice frames with the
since each can change. reference voice sample, which is stored at the receiving
side. It should be noted that no recorded voice is needed.
To summarize, the pre-processing handles codec-specific
operations, such as DTX, codec rate, and Channel Aware Measurement procedure. sQLEAR uses the same test set-
mode, and consequently ensures that the jitter files are up as ITU-T P.863, but without the need for the recorded
both codec and reference voice sample independent. degraded voice sample as has already been explained. In
addition, the recorded speech can be saved for further off-
line analysis., if needed or desired.

The test set-up and run time / measurement scheme is


illustrated in Figure 5.

Jitter file DTX cleaned Jitter file


•• Packet loss No packet loss or jitter
•• Jitter during DTX & Payload size

+
Pre-processing added to each packet
of Jitter files before
applying them to the
VoIP file VoIP file VoIP file
•• EVS coded packets EVS packets with some
in VoIP file format packets missing and
changed time stamps

Figure 4. Jitter file pre-processing process

8
WHITE PAPER

Pre-process
•• Arrival time
Reference Network
IP stream
•• Seq no DTX cleaned Calculate Pretrained Predicted
audio and Devices •• Size Jitter file Features ML model MOS
•• DTX-clean
•• Syncronize

Figure 5. sQLEAR run-time scheme

Does sQLEAR work for


different languages?
sQLEAR has been trained and validated on British- and
American-English. However, as previously mentioned, one
of the key strengths of the sQLEAR algorithm is that it can
easily be trained and adapted. Therefore, it is a simple
process to add different languages. This is a valuable
feature when considering MNOs with multi-national
footprints, which need to optimize costs at a group level.
Figure 6. sQLEAR accuracy
sQLEAR does not use the audio path, but rather only the
time structure of the reference signal for identifying the
importance of individual sections of the bitstream in regard How can sQLEAR be
to speech quality, as well as for the creation of reference- compared with other
based machine learning features. The addition of new
languages requires only a brief temporal analysis of the
solutions?
reference voice file and the subsequent learning of the It is already well-known in the technical community that
algorithm. develops QoE algorithms (and equally well-defined by
ITU-T), that a direct comparison between two different QoE
metrics is not valid, especially when the two algorithms are
What is the accuracy of designed based on different approaches and/or describe

sQLEAR? different voice quality aspects.

This is also the case for sQLEAR, which is based on a


As the first delivery of ongoing work in the ITU-T P.VSQMTF
completely different approach from any other available
“Voice service quality monitoring and troubleshooting
voice QoE measurement technique, such as ITU-T P.863,
framework for parametric intrusive voice QoE prediction”
ITU-T P.563, ITU-T P.564. It is not possible to compare
work item, sQLEAR meets ITU performance requirements.
directly an sQLEAR score, which is independent of the
Figure 6 shows the results of more than 96% correlation
audio path and therefore transparent to the device’s voice
and prediction errors (rmse) lower than 0.26MOS across
frequency characteristics, with an ITU-T P.863 (or an ITU-T
all evaluation databases (60,000 samples). These
P.563) score which works on the voice signal (audio path)
performance values are recognized to describe a high
– and thus embeds the device’s performance. Neither
accuracy when considering the large amount of learning
can an sQLEAR score be directly compared with an ITU-T
and evaluation data points; the difficulty created by the
P.563 or ITU-T P.564 score since these two solutions have
severity of the network conditions; and when compared
not been designed for VoLTE scenarios, much less so HD
with other existing ITU-T non-perceptual (e.g. P.564) and/or
voice (EVS codec).
ITU-T perceptual non-intrusive (e.g. ITU-T P.563) voice QoE
metrics.

9
WHITE PAPER

sQLEAR is uniquely based on


machine learning, which ensures
faster and easier adaptation to new
environments, such as new codec/
clients, and network parameters.

However, because both P.863 and sQLEAR support VoLTE sQLEAR is uniquely based on machine learning, which
service evaluation, quality trends provided by sQLEAR ensures faster and easier adaptation to new environments,
and P.863, and determined based on a large number such as new codec/clients, and network parameters. This
of samples collected using the same voice references can be accomplished without new and costly subjective
and identical network conditions, are expected to be training sequences, which are required by perceptual
the same with a high statistical significance confidence intrusive models (e.g. ITU-T P.863).
level. However, it should be noted that test devices which
sQLEAR avoids device specific degradations caused by
exhibit a particularly strong voice frequency characteristic
the audio path of a mobile device and focuses on the
could be significantly penalized or favored by perceptual
packet-based radio and core network. In addition, it does
models, such as ITU-T P.863. These effects are outside
not use recorded speech, since that will also reflect the
the network and therefore do not represent the network specific device used as a measurement unit, instead of
centric voice quality performance, which in turn sQLEAR network performance. Therefore, sQLEAR predicts speech
is designed to predict. Consequently, in these special quality efficiently and effectively under the following
scenarios, depending on the strength of the device’s voice circumstances and conditions:
frequency characteristic, expected and rightful differences
can be detected between sQLEAR and ITU-T P.863. •• Independently of device acoustical characteristics
(unlike P.863);

•• Without the need of tuning and calibration for

What are the differences each device (unlike P.863), which eliminates costly

versus other voice QoE implementation time; and

solutions? •• From the network- and client-based error


concealment perspective (one of the unique
By leveraging network/codec/client parameters and characteristics of sQLEAR), which is the most cost-
the reference speech sample, sQLEAR is neither solely efficient means to enable network optimization for

speech-based, like perceptual intrusive and non-intrusive high quality voice services since it renders speech

algorithms (e.g. ITU-T P.863, P.563), nor solely parametric quality scores comparable between different device
models.
based and non-intrusive such as ITU-T P.564. In addition,
sQLEAR has been designed for the evaluation of VoLTE As previously mentioned, the sQLEAR measurement
service’s quality, while ITU-T P.563 and P.564 has not. procedure is the same as for ITU-T P.863, in the sense

10
WHITE PAPER

that it sends a reference speech sample to the system evaluation and are part of ITU work. sQLEAR is based on
under test and predicts voice QoE from the combination the ongoing ITU work item, ITU-T P.VSQMTF “Voice service
of output from the device and the sent reference sample. quality monitoring and troubleshooting framework for
However, the output is different. In the case of sQLEAR, intrusive parametric voice QoE prediction”.
the output from the device is the RTP stream, while in
case of ITU-T P.863, it is the recorded audio.

More questions?
What are the similarities For more questions contact us at:

with other voice QoE


www.infovista.com

solutions? and watch the space for our forthcoming “sQLEAR


Implementation Guide” to find out more about sQLEAR
sQLEAR and ITU-T P.863 both belong to the class of
learning, evaluation and performance, detailed database
intrusive voice quality evaluation algorithms, since they
description and pre-processing procedures, machine
send test stimuli through the network under test.
learning based feature selection, and much more!
Both ITU-T P.863 and sQLEAR support VoLTE services Thank you!

11
About Infovista

Infovista, the leader in modern network performance, provides complete visibility and unprecedented
control to deliver brilliant experiences and maximum value with your network and applications. At the core
of our approach are data and analytics, to give you real-time insights and make critical business decisions.
Infovista offers a comprehensive line of solutions from radio network to enterprise to device throughout
the lifecycle of your network. No other provider has this completeness of vision. Network operators
worldwide depend on Infovista to deliver on the potential of their networks and applications to exceed user
expectations every day. Know your network with Infovista.

© Infovista - All rights reserved.

You might also like