Speech Quality Testing Solution (MOS) Whitepaper
Speech Quality Testing Solution (MOS) Whitepaper
(MOS) Whitepaper
This document describes the development and evolution of speech quality testing
technologies in Telecommunications network, and focuses on two kinds of objective
testing methods--PESQ and POLQA.
International Telecommunication
ITU
Union
Speech quality testing on different networks with a unified standard can be a challenge
(based on BER) is used to evaluate the speech quality; and in CDMA network, FER is
used to evaluate the speech quality. In addition, even within the same network, a single
RxQual or FER value cannot represent the true speech quality. A professional speech
general testing method is required to perform direct comparative testing for different
networks.
Based on the test subject, speech quality testing can be divided into two categories:
According to the studies in ITU-T P.800 and ITU-T P.830, about 40 to 60 trained listeners
are required to perform subjective perceptual comparison for reference signal and
degraded signal based on detailed criteria. Score on the degraded signal is in accordance
with the MOS scoring standard (scoring from 0 to 5). This way, a final MOS value is
obtained.
Very clear
No delay
Clear
Little noise
Unclear
A certain of noise
A certain of distortion
Unclear
Serious distortion
The subjective testing result is most reliable, and this method can be used to evaluate
network performance and quality of speech with any speech coding mode. However, its
disadvantages are obvious. In the test, factors (such as evaluation environment and
listener) should be strictly controlled; the speech material must be carefully selected;
otherwise, the final result may be affected. All these makes the test time-consuming,
laborious, difficult to organize, and with poor repeatability. As a result, a more efficient and
repeatable method is required in the actual test, that is, an objective testing method.
In practice, objective testing depends on the parameter comparison (of reference and
degraded speech signal) in time and frequency domain, while the test result is calculated
by hardware or software. Some objective testing methods such as PAMS and PSQM are
introduced during the research of objective speech quality testing. However, these
methods have significant limitations. The test result is affected by particular speech codec,
and in some cases, the result is much different from the MOS value in subjective testing.
In ITU-T P.862-2001, the core speech quality testing method is upgraded to PESQ
algorithm which integrates all advantages of previous algorithms. The PESQ test result is
very close to the MOS value in the subjective testing, and PESQ algorithm is widely
Later, with the development and evolution of new communication technologies, POLQA
algorithm is developed to support new speech codecs and super-wideband speech, and
handle the time factor in VoIP. Compared with previous algorithms, introduction of POLQA
algorithm to the unified and complex communication networks will achieve significant
makes POLQA algorithm applicable to any speech quality testing scenarios. POLQA
algorithm includes two modes: NB (Narrow Band) and SWB (Supper Wideband),
Figure 2.1 shows the evolution of ITU-T recommendations for speech quality testing. The
3.1 Introduction
Figure 3.2 shows the entire PESQ algorithm structure. The model begins by level
aligning both signals to a standard listening level. They are filtered (using an FFT)
with an input filter to model a standard telephone handset. The signals are aligned in
time and then processed through an auditory transform similar to that of PSQM. The
transformation also involves equalizing for linear filtering in the system and for gain
variation. Two distortion parameters are extracted from the disturbance (the difference
between the transforms of the signals), and are aggregated in frequency and time and
mapped to a prediction of subjective MOS. Generally, the greater the difference
between the degraded signal and the reference signal, the lower the speech quality
score.
System Prediction of
under Time Align and Disturbance Cognitive perceived
test Equailise Processing modelling speech
quality
Identify
Auditory
Degraded bad
Level Align Input filter transform
Signal intervals
PSQM and measuring normalizing blocks (MNB) were only recommended for use in
narrowband codec assessment and were known to produce inaccurate predictions
with certain types of codec, background noise, and end-to-end effects such as filtering
and variable delay. The scope of PESQ is therefore very much wider. In addition,
PESQ provides significantly higher correlation with subjective opinion than the
models by P.861, PSQM, and MNB. Results indicate that it gives accurate
predictions of subjective quality in a very wide range of conditions, including those with
background noise, analogue filtering, and/or variable delay.
Table 3 Correlation of different speech quality testing methods
TYPE Corr. Coeff. PESQ PAMS PSQM PSQM+ MNB
Mobile Network average 0.962 0.954 0.924 0.935 0.884
Mobile Network worst-case 0.905 0.895 0.843 0.859 0.731
According to related ITU-T information, PESQ algorithm can provide very accurate
prediction value, and is applicable to all known network technologies (such as GSM,
CDMA, 3 G, etc.,) at that time.
PESQ algorithm is the most sophisticated and accurate speech quality testing method,
and the test result obtained from this method mostly conforms to users' subjective
perceptions.
There are three kinds of PESQ speech quality testing values:
(value range: 1.0 to 5.0, where 1.0 represents the lowest quality)
The value of PESQ SCORE is directly calculated from the algorithm; the value of
PESQ MOS is a subjective mean opinion score. If the speech quality is poor, the value
of PESQ SCORE is always higher than the value of PESQ MOS, which is
unreasonable. In this case, PESQ LQ is introduced by ITU, which value is closer to the
subjective value. In other words, PESQ SCORE is the ideal value calculated by the
Based on simulation and actual test, Figure 3.3 shows ideal PESQ values under
various network conditions and codecs. However, these results are based on the
transmission without errors or packet loss. In real networks, the test results may be
close to these values based on different test environments.
Figure 3.3 Typical PESQ score under various network conditions
4.1 Introduction
PESQ itself contains a very wide range of applications, such as fixed and wireless
network data testing, POTS (Plain Old Telephone Service), VoIP, and 3G. Compared to
PESQ, POLQA makes a variety of improvements to suit scenarios that PESQ is
inapplicable of.
The major improvements of POLQA are listed as follows:
network technologies
Applicable to speech enhanced system (such as VQE and VED) that uses
bandwidth
Supporting direct comparison between AMR (in GSM/CDMA) and EVRC (in
Applying POLQA to today's complex, unified networks will give a significant boost in
Telecom industries are now initiating the evolution from narrow-band telephony to
wideband speech transmission. The codecs for wide band are ready. Current
developments of voice codecs are processing the so-called super-wideband (up to
14,000 Hz) or even higher (‘full-band’), up to approx. 24,000 Hz. However, the
perceived difference between super-wideband and full-band can be ignored in the
case of human speech.
In the speech quality testing, users will face corresponding bandwidth problems. In
traditional telephony scenarios, the expectation is set to a perfect narrow-band voice
signal. A signal that is close or identical to such a signal is scored subjectively by
human listeners with a high quality value (usually a MOS-LQ of around 4.5 on a
five-point scale). Within a super-wideband scenario the situation is different. The
expectation of excellent quality is a perfect super-wideband speech signal. Since the
same five-point scale is used, such a perfect super-wideband signal is also
subjectively scored close to excellent in the range of 4.5. Obviously, a narrow-band
signal in that super-wideband context will not fulfil the expectation of high quality due to
its band limitation. Consequently, it will be scored lower in this context.
Since the range of the scores is the same but the meaning is different depending on
the context, the two are named as different scales: narrow-band or super-wideband.
Broadly the main difference is that narrow-band signals will be scored lower in a
super-wideband context than in narrow-band experiments, since the band-limitation is
scored as degradation. Hence, scores given on the two different scales must not be
POLQA uses an advanced psycho-acoustic model for emulating the human perception
and transforming the sound into an internal neuronal representation. POLQA, as a full
reference approach, compares the input or high quality reference signal and the
associated degraded signal under test. This process is shown in Figure 5. POLQA
takes into account masking effects of the human hearing and uses the concept of
idealization of both input signals in multiple steps. This ensures that only the relevant
perfect speech information is used for comparison and any unwanted signal
components are discarded.
Figure 4.1 shows the POLQA algorithm structure. The module performs space/time
alignment for the reference signal and degrading signal, which is used to estimate the
delay and sample rate differences between the two signals. Once the correct delay is
determined and the sample rate differences have been compensated, the signals and
the delay information are passed on to the core model, which calculates the
perceptibility as well as the annoyance of the distortions and maps them to a MOS
scale.
Figure 4.2 shows typical values to be expected from POLQA. These were confirmed by
subjective auditory experiments. In actual network, the MOS values may be close to
the following values based on environmental factors.
Figure 4.2 Typical values to be expected from POLQA
Released Pilot Pioneer test tool with POLQA support at the end of 2012
As an integrated test platform, Pilot Pioneer can be upgraded to the version with MOS
test functions just by adding a separate audio MOS box.
Note: In the remaining of this document, MOS value refers to the speech quality testing
score based on PESQ or POLQA unless stated otherwise, and MOS box refers to the
independent hardware system used by Dingli to test speech quality.
In addition to providing the latest speech quality test tool, Dingli also focuses on
speech quality optimization in practice. Dingli’s solution analyzes and explores the
exact impact of wireless environment factors on the MOS values to provide the most
accurate and credible theoretical and practical reference information for network
optimization. The main research content includes:
Impact of C/I (Carrier/Interference) on the MOS value within the same coverage
Impact of signal strength on the MOS value when the C/I is good
Dingli MOS box is an accessory specially designed for the wireless network speech
quality testing. Users may use Pilot Pioneer and MOS box with different test terminals
to complete speech quality testing for various networks in various scenarios. In
practical, the MOS box has a variety of technical advantages. See Table 4.
Table 4 Technical advantages of Dingli MOS box
Item Description
45 (L) x 13 (H) x 38 (W) cm
Standard weight: 5 KG
Compact
Power: Built-in battery or external power supply
Easy to carry
Aluminum alloy material, anti-compression and
anti-seismic
Protective Material
Plastics and protective film composition for mobile
phone slots
MOS test includes three test solutions based on Pilot Pioneer, Pilot RCU, and Pilot
Walktour respectively, and one analysis solution based on Pilot Navigator. All the three
test solutions support PESQ and POLQA algorithm. Users only need to select PESQ or
POLQA during configuration.
1. Mobile-to-Mobile
This mode supports speech quality and benchmarking test for operators, with a
maximum of four networks simultaneously. The test terminals can be randomly
combined with any network technologies (such as 2G1C, 3W1C, and etc.).
Users may initiate a call from a mobile phone to another mobile phone. The calling
party initiates a call, and a speech sample is replayed after the connection, and
returned to the called party through the base station. The called party records the
©Dingli (27/7/2013) DL1AMOSWP Rev1 19 / 37
speech and compare the speech with a standard speech sample to obtain the uplink
value of the calling party (it is also the downlink value of the called party). Then, the
called party replays the speech, and returns the speech to the calling party through
the base station. The calling party records the speech and compares the speech with
a standard speech sample to obtain the uplink value of the called party (it is also the
downlink value of the calling party). Users may alternate the test terminals and
perform infinite loop testing.
Note: Above description is valid for PESQ. For POLQA, currently Pilot Pioneer
supports POLQA score only for the calling party.
2. Mobile-to-Land
Users may conduct mobile-to-land MOS test based on network type, or customize
the solution by defining the terminal type and quantity according to network type.
Currently, this test mode supports a maximum of four networks simultaneously.
Note: For POLQA, currently Pilot Pioneer supports only 8K speech sampling
and downlink POLQA score when POLQA algorithm is used.
Figure 5.2 Mobile-to-Land
Dingli Pilot Fleet (RCU) supports automatic speech quality testing, which can
internally integrate multiple MOS test modules to perform single-network MOS testing
or multi-network MOS benchmarking testing. It offers a variety of end-to-end testing
methods; such as from a remote test module to a fixed-line phone, from a test module
to a test module, and from a test module to a test module in different cities. This in
turn allows user to test pure uplink or downlink MOS value, user’s perceptual MOS
value, and long-distance speech quality MOS value.
The testing process is almost the same as that of from an RCU test module to a
server. The difference is that the calls are between two RCU test modules or two test
modules in the same RCU, and this solution also supports multi-channel MOS
testing.
Note: This mode supports a maximum of two dual-core RCU test modules to
dial to each other when POLQA algorithm is used.
1. Mobile to Mobile
As shown in Figure 5.7, Dingli Pilot Walktour supports calling from a mobile phone to
a mobile phone to perform MOS testing. By using the mobile phone integrated
software kernel and MOS algorithm, users may use one mobile phone to replay the
speech sample and the other to record the voice, and perform speech quality testing.
Note: In this mode, IOS Walktour POLQA score can only be viewed when
analyzed with Pilot Navigator but not on the mobile phone; Android Walktour
supports POLQA score only on the mobile phone of the calling party.
2. Mobile-to-Land
Users may conduct mobile-to-land MOS test based on network type, or customize
the solution by defining the terminal type and quantity according to network type.
Note: In this mode, iOS Walktour support POLQA score only when analyzed
with Pilot Navigator but not on the mobile phone. Android Walktour supports
POLQA score only on the mobile phone of the calling party.
MOS value in the EFR mode > MOS value in the FR mode > MOS value in the HR
mode
The following is a group of speech quality testing results (from a mobile phone to a
fixed-line phone) without environmental interference.
Uplink PESQ MOS value: value in the EFR mode (max 4.20) > value in the
Downlink PESQ MOS value: value in the EFR mode (max 4.255) > value in
the FR mode (max 3.940) > value in the HR mode (max 3.728)
©Dingli (27/7/2013) DL1AMOSWP Rev1 25 / 37
In commercial networks, because of network interference and other factors, the test
results obtained by using different coding rates will be quite different from the results
obtained in ideal environment. Table 6 shows an example of the average value of an
actual speech quality testing in urban environment.
Table 6 Impact of HR in GSM Network on PESQ result
HR Percentage HR Percentage
PESQ Result HR Rate = 0%
= 40% =100%
Uplink PESQ
3.588 3.428 3.331
MOS value
Downlink PESQ
3.418 3.325 3.259
MOS value
When the GSM RxQual Sub value range is [0, 2], the PESQ MOS value
When the CDMA FFER value range is [0, 3%], the PESQ MOS value range is
[3, 4.1].
Generally, if the RxQual Sub/FFER value is high, the MOS value is low. However, if
the MOS value is low, the RxQual Sub/FFER value is not necessarily high (the MOS
value may be affected by other factors).
Good network environment is the basic element ensuring good wireless
communication. Low C/I and C/A (Carrier/Adjacent) value will result in high BER,
which will decrease the call quality or triggers dropped call. High BER caused by
network frequency interference have been the major concern for network
optimization. From user’s perceptual aspect, transient BER does not affect user's
listening experience, while continuous BER causes frame loss and serious impact on
listening.
Following conclusion can be made after practical tests were conducted:
In EFR mode, when the downlink RXQUAL value is greater than 4.8, the
downlink PESQ MOS value is lower; when the downlink RXQUAL value is
greater than 5.4, the downlink PESQ MOS value is lower than 3.3; when the
downlink RXQUAL value is greater than 6, the downlink PESQ MOS value is
In FR mode, when the downlink RXQUAL value is greater than 5.1, the
downlink PESQ MOS value is lower; when the downlink RXQUAL value is
greater than 5.6, the downlink PESQ MOS value is lower than 3.3; when the
downlink RXQUAL value is greater than 6, the downlink PESQ MOS value is
In HR mode, when the downlink RXQUAL value is greater than 4.8, the
greater than 5.2, the downlink PESQ MOS value is lower than 3.3; when the
downlink RXQUAL value is greater than 6, the downlink PESQ MOS value is
Handover has serious impact on the PESQ MOS value. In addition, when the
In EFR mode, if the handover occurs once every six seconds in transferring
uplink and downlink voice, the average uplink and downlink PESQ MOS
value is 1 lower than the maximum value. If the handover occurs twice every
six seconds in transferring uplink and downlink voice, the average uplink and
uplink and downlink voice, the average uplink and downlink PESQ MOS
value is 1 lower than the maximum value. If the handover occurs twice every
six seconds in transferring uplink and downlink voice, the average uplink and
downlink PESQ MOS value is 1.5 lower than the maximum value.
PESQ MOS value (the PESQ MOS value may dropped close to 1).
6. Impact of Signal Strength on MOS Value
When signal strength changes and BER / FER is not greater than 0, the RXQUAL
Theoretically, the parameters affecting PESQ MOS values will have an impact on the
POLQA test results. POLQA research is currently ongoing. This section describes the
detailed information about the POLQA test interface and parameters in Pilot Pioneer.
Figure 5.8 shows the POLQA test interface in Pilot Pioneer.
The reference wave and degraded wave is displayed on the upper part of the
interface, and the POLQA test results are displayed on the lower part of interface,
including information about Library Version, Processing Mode (NB/SWB), Mean
Delay, Minimum Delay, Maximum Delay, and etc. For detailed information, see
Table 8.
Table 8 POLQA parameters in Pilot Pioneer
Parameter Description
KHz respectively.
Level Reference The Level of the reference signal in dBov (averaged over
(dBov) the entire signal).
Pause Level Reference The silence level of the reference signal in dBov,
(dBov) measured similar to P.56.
Pause Level Degraded The silence level of the degraded signal in dBov,
(dBov) measured similar to P.56.
Record file The recorded degraded signal file, along with its location.
Table 9, Table 10, and Table 11 describes the mean value and excellent value ratio
(excellent value range: 3.0 to 4.5) of three live networks with different scenarios in a
single test.
Table 9 Testing result of Operator A
Urban DT Voice Highway DT Voice
PESQ-LQ 3.0-4.5 PESQ-LQ 3.0-4.5
Mean Value Percentage Mean Value Percentage
3.27 88.17% 3.25 86.00%
4. For further study. Factors, technologies and applications for which PESQ has
not currently been validated
[1]. OPTICOM GmbH& SwissQual AG, Perceptual Objective Listening Quality Analysis,
[4]. ITU-T Recommendation P.862.1. Mapping Function for Transforming P.862 Raw
January 2011
1/P.862, Table 2/P.862, Table 3/P.862 ITU‐T Temporal structure and duration of source