Test Battery for Opus ML Codec Extensions
draft-lechler-mlcodec-test-battery-02
This document is an Internet-Draft (I-D).
Anyone may submit an I-D to the IETF.
This I-D is not endorsed by the IETF and has no formal standing in the
IETF standards process.
| Document | Type | Active Internet-Draft (candidate for mlcodec WG) | |
|---|---|---|---|
| Authors | Laura Lechler , Kamil Wojcicki | ||
| Last updated | 2025-11-18 (Latest revision 2025-11-06) | ||
| RFC stream | Internet Engineering Task Force (IETF) | ||
| Intended RFC status | (None) | ||
| Formats | |||
| Additional resources | Mailing list discussion | ||
| Stream | WG state | Call For Adoption By WG Issued | |
| Document shepherd | (None) | ||
| IESG | IESG state | I-D Exists | |
| Consensus boilerplate | Unknown | ||
| Telechat date | (None) | ||
| Responsible AD | (None) | ||
| Send notices to | (None) |
draft-lechler-mlcodec-test-battery-02
Machine Learning for Audio Coding L. Lechler
Internet-Draft K. Wojcicki
Intended status: Informational Cisco Systems
Expires: 10 May 2026 6 November 2025
Test Battery for Opus ML Codec Extensions
draft-lechler-mlcodec-test-battery-02
Abstract
This document proposes methodology and data for evaluation of machine
learning (ML) codec extensions, such as the deep audio redundancy
(DRED), within the Opus codec (RFC6716).
About This Document
This note is to be removed before publishing as an RFC.
Status information for this document may be found at
https://datatracker.ietf.org/doc/draft-lechler-mlcodec-test-battery/.
Discussion of this document takes place on the Machine Learning for
Audio Coding Working Group mailing list (mailto:mlcodec@ietf.org),
which is archived at https://mailarchive.ietf.org/arch/browse/
mlcodec/. Subscribe at https://www.ietf.org/mailman/listinfo/
mlcodec/.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 10 May 2026.
Lechler & Wojcicki Expires 10 May 2026 [Page 1]
Internet-Draft MlCodecTestBattery November 2025
Copyright Notice
Copyright (c) 2025 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Listening Test Methods . . . . . . . . . . . . . . . . . 3
1.1.1. MUSHRA--1S . . . . . . . . . . . . . . . . . . . . . 4
1.1.2. DCR . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3. DRT . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.4. Crowdsourcing Adaptations . . . . . . . . . . . . . . 5
2. Proposed Crowdsourced Listening Test Battery . . . . . . . . 5
2.1. Speech Quality Evaluation . . . . . . . . . . . . . . . . 6
2.1.1. Clean Speech Test Vectors . . . . . . . . . . . . . . 6
2.1.2. Real-World Degradation Test Vectors . . . . . . . . . 6
2.1.3. Simultaneous Talker Test Vectors . . . . . . . . . . 7
2.1.4. Packet Loss Scenarios . . . . . . . . . . . . . . . . 7
2.2. Speech Intelligibility Evaluation . . . . . . . . . . . . 7
2.2.1. Clean Speech Test Vectors . . . . . . . . . . . . . . 7
2.2.2. Noisy Test Vectors . . . . . . . . . . . . . . . . . 7
2.3. Example Results . . . . . . . . . . . . . . . . . . . . . 8
3. Objective Evaluation . . . . . . . . . . . . . . . . . . . . 9
4. Conventions and Definitions . . . . . . . . . . . . . . . . . 10
5. Security Considerations . . . . . . . . . . . . . . . . . . . 10
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10
7. References . . . . . . . . . . . . . . . . . . . . . . . . . 10
7.1. Normative References . . . . . . . . . . . . . . . . . . 10
7.2. Informative References . . . . . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13
Lechler & Wojcicki Expires 10 May 2026 [Page 2]
Internet-Draft MlCodecTestBattery November 2025
1. Introduction
The IETF machine learning for audio coding (mlcodec) working group
aims to leverage current and future opportunities presented by ML
codecs to enhance the Opus codec [RFC6716] and its extensions,
including to improve speech coding quality and robustness to packet
loss. Effective evaluation of codec extensions (such as DRED), in
both standalone and redundancy settings, is a crucial factor in
achieving those objectives. It supports reproducibility for existing
extensions (for instance, by enabling validation of whether a
retraining pipeline matches baseline model performance) and enables
benchmarking of future improvements against previously established
baselines.
However, as outlined in subsequent sections, effective evaluation of
generative ML models presents numerous challenges and necessitates
specialized subjective and objective evaluation methods. This
document proposes a crowdsourced subjective test battery, along with
associated test datasets, to address the unique requirements for
accurate and reproducible evaluations of ML codecs. The proposed
test battery covers both speech quality and intelligibility,
including tests in clean, noisy, and reverberant conditions, and
incorporates real-world audio data. The methodology leverages
crowdsourced listeners [CROWDSOURCED-DRT] to enable rapid and
scalable assessments, while controlling the variability associated
with non-lab-based measurements.
In the era of generative ML models, reference-based objective metrics
face additional limitations, while non-intrusive methods struggle
with generalization, e.g., [URGENT2025] and [CROWDSOURCED-MUSHRA].
Consequently, the use of human listeners, the gold standard in both
quality and intelligibility assessment, is of notable importance.
The generative nature of ML codecs also implies that speech
intelligibility could be significantly improved and/or degraded by
such algorithms. For example, human perception for some phoneme
categories could be enhanced, while confusions might be introduced
for others, including hallucinations of incorrect phonemes even at
high overall perceived quality. Such confusions may not be easily
detected in quality tests, highlighting a pressing need for highly
diagnostic phoneme-category, or even phoneme-level, intelligibility
assessment methods.
The subsequent sections present the methodology, key considerations,
and further motivation underlying the proposed test battery,
addressing the challenges and requirements discussed above.
1.1. Listening Test Methods
Lechler & Wojcicki Expires 10 May 2026 [Page 3]
Internet-Draft MlCodecTestBattery November 2025
1.1.1. MUSHRA--1S
MUSHRA--1S [MUSHRA-1S] is a variant of the well-established MUSHRA
(multiple stimuli with hidden reference and anchor) methodology for
assessing quality [ITU-R.BS1534-3] in clean non-reverberant
conditions is proposed for testing and benchmarking of ML codecs.
MUSHRA is firstly adapted to a crowdsourced, non-expert listener
base, as described in [CROWDSOURCED-MUSHRA]. Particularly for
generative models, which may cause hallucinations, a reference-based
listening test is preferable [URGENT2025]. Secondly, one system
under test is assessed at a time, in the context of a fixed reference
and anchor. The advantages of testing one system at a time are the
unlimited extendability of test conditions within the quality range
of anchor and reference, avoiding context effects of other conditions
within the same test, avoiding difficulties associated when merging
results across multiple tests, and simplifying the task for the
participants thereby avoiding listener fatigue, particularly in non-
expert listeners. As such, MUSHRA--1S is similar to absolute
category rating (ACR) tests, which can be used to calculate a mean
opinion score (MOS), in that it is simple and easily extendable. At
the same time, it is more stable than ACR, due to the fixed range of
expected audio quality, bound by the anchor and reference.
Reference-less MOS scores have been demonstrated to suffer from
range-equalizing biases [COOPER2023], with other samples presented
within the same test defining the range of expectation of what
constitutes "good" or "bad" speech quality. The drawback of the
MUSHRA--1S solution, compared to a traditional MUSHRA test, is the
slightly decreased sensitivity to very small differences between
similar methods, which may only be detectable in direct comparisons.
1.1.2. DCR
The degradation category rating (DCR) approach is used to produce a
degradation mean opinion score (DMOS) [ITU-T.P800]. Although it is
typically used with a high-quality reference, the test is also
capable of assessing degradation caused by codecs when tested on
mild-to-moderately impaired real-world data [MULLER2024]. The
approach is more sensitive than absolute category ratings (ACR)
[ITU-T.P800]. An implementation of the test procedure for
crowdsourced tests is available in [ITU-T.P808].
Lechler & Wojcicki Expires 10 May 2026 [Page 4]
Internet-Draft MlCodecTestBattery November 2025
1.1.3. DRT
The diagnostic rhyme test (DRT) [ITU-T.P807] measures speech
intelligibility by presenting minimal pairs where the contrasted
phonemes differ in terms of a specific, controlled phonetic category.
The linguistic and acoustic insight of the DRT, with test items
belonging to classes of distinctive linguistic features which are
acoustically interpretable, poses a useful tool for both codec
analysis and benchmarking. The test is free from context-effects and
memory effects and has a high test sensitivity. It is therefore
well-suited for a crowdsourced listener audience. Bearing in mind
the principles for crowdsourcing listening tests employed in
[ITU-T.P808], the test was adapted for crowdsourced listening tests
in [CROWDSOURCED-DRT] and test vectors in five languages were
published [DRT-REPO]. The test data was recently adopted by
[LESCHANOWSKY2025].
1.1.4. Crowdsourcing Adaptations
Crowdsourced listening tests benefit from rigorous screening and
quality control. In addition to the specific implementation of
standardized test approaches for crowdsourced listening tests,
[ITU-T.P808] has provided useful guiding principles for the
adaptation of laboratory-based tests to counteract challenges posed
by the comparatively uncontrolled crowdsourcing environment. For
instance, steps of qualification and training are added before the
actual test stimuli are presented and catch trials are included in
the pool of test questions. It is further recommended to assess the
quality of participants' responses across different platforms, such
as Amazon Mechanical Turk, Prolific, and others
[CROWDSOURCED-MUSHRA]. Each platform has a unique set of filters
that can be used to recruit a specific participant pool. The
platform and any filters used should always be reported along with
test results, as absolute results may depend on those settings and
may differ considerably between platforms.
2. Proposed Crowdsourced Listening Test Battery
In the literature, evaluations of speech codec quality often focus
solely on clean conditions. However, given the wide range of
potential applications for modern speech codecs, and the unique ways
in which ML codecs may be affected by various types of real-world
distortions, it is important to assess their limitations under
representative real-world scenarios, including challenging listening
conditions.
Lechler & Wojcicki Expires 10 May 2026 [Page 5]
Internet-Draft MlCodecTestBattery November 2025
In addition to clean speech data, the proposed test battery considers
performance evaluation on overlapping speech, reverberant and noisy
speech, speaker consistency and phoneme-level intelligibility. The
current version comprises predominantly English test vectors, but the
extension to include multiple languages is desirable. Some of the
modules of the test battery outlined below for assessment of
standalone ML codec performance can also be used (where applicable),
for assessing the performance of redundancy schemes under packet loss
conditions (e.g., Opus+DRED).
The proposed test vectors are publicly available at a sampling rate
of 24 kHz at https://github.com/cisco/multilingual-speech-
testing/tree/main/LRAC-2025-test-data/blind-test-set/track_1
(https://github.com/cisco/multilingual-speech-testing/tree/main/LRAC-
2025-test-data/blind-test-set/track_1).
2.1. Speech Quality Evaluation
2.1.1. Clean Speech Test Vectors
By employing the MUSHRA--1S approach and utilizing high-quality clean
speech data, the system under test is evaluated with respect to the
overall quality. The reference allows the listener to assess also
the correctness of the linguistic content as well as the preservation
of the speaker characteristics. In this test, the quality of each
codec or extension is assessed in standalone mode. The diverse test
set comprises 100 gender-balanced clean speech files, covering 100
unique speakers, and includes samples from both adult and children's
speech. Furthermore, the set of test vectors covers a diverse range
of accents of English.
2.1.2. Real-World Degradation Test Vectors
As speech codecs may be used by a wide variety of applications, it
cannot be ensured that the audio to be compressed constitutes clean
speech in the sense of dry and noise-free high-quality audio. It is
therefore important to assess the codec's resilience to real-world
degradation. For tests where test vectors have impaired quality, DCR
offers an effective way to measure the severity of any additional
degradation introduced by the codec. The test data consists of 90
crowdsourced speech files in mildly impaired real-world scenarios of
noise and reverberation. Of these, 45 files are predominantly
focussed on reverberant speech and 45 on speech in noise. The
reverberation and noise levels are mild to moderate.
Lechler & Wojcicki Expires 10 May 2026 [Page 6]
Internet-Draft MlCodecTestBattery November 2025
2.1.3. Simultaneous Talker Test Vectors
Most application purposes rely on the codec's capability of
preserving simultaneously occurring speech from multiple talkers.
However, in practice, this can be a challenging task. A listening
test using the DCR methodology offers insights into whether the
presence of overlapping speech leads to degradation, which may occur
in the form of artifacts or speech suppression. The proposed test
set consists of 20 files of conversations between two to three
talkers.
2.1.4. Packet Loss Scenarios
Real-world packet loss traces and/or simulated loss patterns
(including using the packet loss simulator provided by the working
group in Opus) can be utilized to evaluate the overall quality of
redundancy codecs, such as Opus and DRED working together.
Details TBD.
2.2. Speech Intelligibility Evaluation
2.2.1. Clean Speech Test Vectors
The DRT for evaluating speech intelligibility, adapted for
crowdsourced participants [CROWDSOURCED-DRT], is proposed to be
performed on a subset of the stimuli provided in [DRT-REPO]. The
subset consists of two test vectors, one male and one female talker
sample, for each word pair in the standard DRT word list for English
[ITU-T.P807]. Test vectors for four other languages are also
available in the same collection. Due to listeners' perceptual
sensitivity to the subtle and highly localized cues that distinguish
the two target phonemes, this test is primarily applicable in the
evaluation of standalone codecs, with limited expected utility when
combined with packet losses and redundancy schemes.
2.2.2. Noisy Test Vectors
In order to evaluate a codec's resilience to noise in terms of speech
intelligibility, the proposed evaluation battery for ML codecs
contains noisy counterparts for the clean speech test vectors
described in the previous paragraph. Speech-shaped noise (SSN) is
used as a stationary additive masker in which intelligibility can be
evaluated. While the presence of noise may lead to particularly
severe codec distortion in some models, even the presence of well-
preserved noise can help to distinguish the intelligibility of high-
quality models that demonstrate a ceiling effect in clean conditions.
The use of stationary noise is essential for the DRT to ensure
Lechler & Wojcicki Expires 10 May 2026 [Page 7]
Internet-Draft MlCodecTestBattery November 2025
uniform effects on the short-term localized perceptual cues. For the
same reason, the noisy version of the test is also geared towards the
evaluation of standalone codecs. The SSN was generated based on
long-term-averaged short-term spectra of a publicly available clean
speech data set [DEMIRSAHIN2020]. The average spectrum was used as a
filter that was convolved with white noise, resulting in SSN.
2.3. Example Results
The results shown in Table 1 below were obtained by using test
methodology described above. Subjective tests were run on the
Prolific crowdsourcing platform. The participants were required to
be native speakers of English, with an approval rate of at least 98%
and at least 110 previous submissions. Only participants without any
self-reported hearing impairments and without a cochlear implant were
invited to participate. Additionally, diagnostic rhyme test studies
were only open to participants who self-reported not to have have
dyslexia.
+=============+==============+=================+==================+
| Codec | Quality in | Intelligibility | Quality in Real- |
| | Clean Speech | in Clean Speech | World Noise and |
| | (MUSHRA) | (DRT) [95% CI] | Reverberation |
| | [95% CI] | | (DCR) [95% CI] |
+=============+==============+=================+==================+
| Input | 98.3 [+/- | 94.9 [+/- 1.3] | 4.7 [+/- 0.1] |
| | 0.2] | | |
+-------------+--------------+-----------------+------------------+
| Opus v1.5.2 | 85.4 [+/- | 90.0 [+/- 2.0] | 4.3 [+/- 0.1] |
| 9000 bps | 1.7] | | |
| NOLACE | | | |
+-------------+--------------+-----------------+------------------+
| Opus v1.5.2 | 70.2 [+/- | 90.6 [+/- 1.8] | 3.9 [+/- 0.1] |
| 9000 bps | 2.0] | | |
| LACE | | | |
+-------------+--------------+-----------------+------------------+
| Opus v1.5.2 | 56.2 [+/- | 89.0 [+/- 2.0] | 3.3 [+/- 0.1] |
| 9000 bps | 2.3] | | |
+-------------+--------------+-----------------+------------------+
| Opus v1.5.2 | 24.0 [+/- | 86.3 [+/- 2.4] | 3.0 [+/- 0.1] |
| 6000 bps | 0.7] | | |
+-------------+--------------+-----------------+------------------+
| DRED SA | 60.6 [+/- | 90.5 [+/- 2.2] | 3.1 [+/- 0.1] |
| v1.5.2 q0 | 1.5] | | |
| 1772 bps | | | |
+-------------+--------------+-----------------+------------------+
| DRED SA | 62.3 [+/- | 88.1 [+/- 2.5] | 2.7 [+/- 0.1] |
| v1.5.2 q6 | 1.7] | | |
Lechler & Wojcicki Expires 10 May 2026 [Page 8]
Internet-Draft MlCodecTestBattery November 2025
| 957 bps | | | |
+-------------+--------------+-----------------+------------------+
| DRED SA | 41.1 [+/- | 80.9 [+/- 3.3] | 1.8 [+/- 0.1] |
| v1.5.2 q10 | 1.6] | | |
| 423 bps | | | |
+-------------+--------------+-----------------+------------------+
| DRED SA | 61.4 [+/- | 90.4 [+/- 2.0] | 3.2 [+/- 0.1] |
| Candidate_A | 1.8] | | |
| greg189 q1 | | | |
| 1735 bps | | | |
+-------------+--------------+-----------------+------------------+
| DRED SA | 53.0 [+/- | 87.7 [+/- 2.4] | 2.5 [+/- 0.1] |
| Candidate_A | 1.3] | | |
| greg189 q6 | | | |
| 848 bps | | | |
+-------------+--------------+-----------------+------------------+
| DRED SA | 37.5 [+/- | 82.9 [+/- 2.9] | 1.9 [+/- 0.1] |
| Candidate_A | 1.8] | | |
| greg189 q9 | | | |
| 425 bps | | | |
+-------------+--------------+-----------------+------------------+
| DRED SA | 61.4 [+/- | 90.9 [+/- 2.1] | 3.1 [+/- 0.1] |
| Candidate_B | 1.6] | | |
| jm26d q1 | | | |
| 1786 bps | | | |
+-------------+--------------+-----------------+------------------+
| DRED SA | 50.4 [+/- | 88.9 [+/- 2.4] | 2.5 [+/- 0.1] |
| Candidate_B | 1.4] | | |
| jm26d q6 | | | |
| 868 bps | | | |
+-------------+--------------+-----------------+------------------+
| DRED SA | 36.8 [+/- | 84.8 [+/- 2.7] | 1.9 [+/- 0.1] |
| Candidate_B | 1.7] | | |
| jm26d q9 | | | |
| 456 bps | | | |
+-------------+--------------+-----------------+------------------+
Table 1
3. Objective Evaluation
Objective metrics are often used during the development of speech
codecs, with expert evaluations conducted towards the end of the
development lifecycle. While effective for traditional DSP-based
codecs, traditional well-established reference-based metrics, such as
PESQ [ITU-T.P862], often fail to accurately evaluate generative
methods. For instance, PESQ has been empirically shown to have an
underestimation bias for generative models which may have high output
Lechler & Wojcicki Expires 10 May 2026 [Page 9]
Internet-Draft MlCodecTestBattery November 2025
quality but for which the output may also considerably differ from
the reference [CROWDSOURCED-MUSHRA].
At present, the research into alternative metrics is flourishing with
various innovative methods being proposed,
such as non-intrusive DNN-based metrics (e.g, [UTMOS]), metrics with
non-matched references (e.g., [SCOREQ]), or composite score types of
metrics (e.g., [UNI-VERSA]). While recent correlation
investigations, e.g., [URGENT2025], are promising, it is too early to
include such metrics in this proposal, as it is yet to be seen which
metrics can demonstrate both good accuracy and generalization to a
variety of generative models and test vectors. Further insights in
this area are of potential value for rapid, accessible and
inexpensive evaluation of ML codecs. Hence, we propose to
investigate which objective metrics are effective predictors of
listener responses for the test battery components, and under which
conditions.
4. Conventions and Definitions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
5. Security Considerations
TBD
6. IANA Considerations
This document has no IANA actions.
7. References
7.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/rfc/rfc2119>.
[RFC6716] Valin, JM., Vos, K., and T. Terriberry, "Definition of the
Opus Audio Codec", RFC 6716, DOI 10.17487/RFC6716,
September 2012, <https://www.rfc-editor.org/rfc/rfc6716>.
Lechler & Wojcicki Expires 10 May 2026 [Page 10]
Internet-Draft MlCodecTestBattery November 2025
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.
7.2. Informative References
[COOPER2023]
Cooper, E. and J. Yamagishi, "Investigating Range-
Equalizing Bias in Mean Opinion Score Ratings of
Synthesized Speech", INTERSPEECH 2023, pages 1104--1108,
n.d., <https://www.isca-archive.org/interspeech_2023/
cooper23_interspeech.pdf>.
[CROWDSOURCED-DRT]
Lechler, L. and K. Wojcicki, "Crowdsourced Multilingual
Speech Intelligibility Testing", ICASSP 2024,
DOI 10.1109/ICASSP48485.2024.10447869, n.d.,
<https://ieeexplore.ieee.org/document/10447869>.
[CROWDSOURCED-MUSHRA]
Lechler, L., Moradi, C., and I. Balic, "Crowdsourcing
MUSHRA Tests in the Age of Generative Speech Technologies:
A Comparative Analysis of Subjective and Objective Testing
Methods", INTERSPEECH 2025, n.d.,
<https://arxiv.org/abs/2506.00950>.
[DEMIRSAHIN2020]
Demirsahin, I., Kjartansson, O., Gutkin, A., and C.
Rivera, "Crowdsourced high-quality UK and Ireland English
Dialect speech data set.", LREC 2020, pages 6532--6541,
ISBN 979-10-95546-34-4, n.d.,
<https://www.aclweb.org/anthology/2020.lrec-1.804>.
[DRT-REPO] Cisco Systems, "Multilingual Speech Testing - Speech
Intelligibility DRT", n.d., <https://github.com/cisco/
multilingual-speech-testing/tree/main/speech-
intelligibility-DRT>.
[ITU-R.BS1534-3]
ITU-R, "Method for the subjective assessment of
intermediate quality level of audio systems",
ITU-R Recommendation BS.1534-3, October 2015.
[ITU-T.P800]
ITU-T, "Methods for subjective determination of
transmission quality", ITU-T Recommendation P.800, August
1996.
Lechler & Wojcicki Expires 10 May 2026 [Page 11]
Internet-Draft MlCodecTestBattery November 2025
[ITU-T.P807]
ITU-T, "Subjective test methodology for assessing speech
intelligibility", ITU-T Recommendation P.807, February
2016.
[ITU-T.P808]
ITU-T, "Subjective evaluation of speech quality with a
crowdsourcing approach", ITU-T Recommendation P.808, June
2021.
[ITU-T.P862]
ITU-T, "Perceptual evaluation of speech quality (PESQ): An
objective method for end-to-end speech quality assessment
of narrow-band telephone networks and speech codecs",
February 2001, <https://www.itu.int/rec/T-REC-P.862>.
[LESCHANOWSKY2025]
Leschanowsky, A., Lakshminarayana, K.K., Rajasekhar, A.,
Behringer, L., Kilinc, I., Fuchs, G., and E.A.P. Habets,
"Benchmarking Neural Speech Codec Intelligibility with
SITool", INTERSPEECH 2025, DOI 10.48550/arXiv.2506.01731,
n.d., <https://arxiv.org/abs/2506.01731v1>.
[MULLER2024]
Muller, T., Ragot, S., Gros, L., Philippe, P., and P.
Scalart, "Speech quality evaluation of neural audio
codecs", INTERSPEECH 2024, pages 1760--1764, n.d.,
<https://www.isca-archive.org/interspeech_2024/
muller24c_interspeech.pdf>.
[MUSHRA-1S]
Lechler, L. and I. Balic, "MUSHRA-1S: A scalable and
sensitive test approach for evaluating top-tier speech
processing systems", Preprint 2025, n.d.,
<https://arxiv.org/abs/2509.19219>.
[SCOREQ] Ragano, A., Skoglund, J., and A. Hines, "SCOREQ: Speech
Quality Assessment with Contrastive Regression",
NeurIPS 2024, pages 105702--105729, n.d.,
<https://proceedings.neurips.cc/paper_files/paper/2024/
file/bece7e02455a628b770e49fcfa791147-Paper-
Conference.pdf>.
Lechler & Wojcicki Expires 10 May 2026 [Page 12]
Internet-Draft MlCodecTestBattery November 2025
[UNI-VERSA]
Shi, J., Shim, H.J., and S. Watanabe, "Uni-VERSA:
Versatile Speech Assessment with a Unified Network",
DOI 10.48550/arXiv.2505.20741,
target https://arxiv.org/abs/2505.20741, 2025,
<https://doi.org/10.48550/arXiv.2505.20741>.
[URGENT2025]
Saijo, K., Zhang, W., Cornell, S., Scheibler, R., Li, C.,
Ni, Z., Kumar, A., Sach, M., Fu, Y., Wang, W.,
Fingscheidt, T., and S. Watanabe, "Interspeech 2025 URGENT
Speech Enhancement Challenge", INTERSPEECH 2025,
target https://arxiv.org/abs/2505.23212, n.d..
[UTMOS] Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi,
S., and H. Saruwatari, "UTMOS: UTokyo-SaruLab System for
VoiceMOS Challenge 2022", INTERSPEECH 2022, pages 4521--
4525, n.d., <https://www.isca-
archive.org/interspeech_2022/saeki22c_interspeech.pdf>.
Authors' Addresses
Laura Lechler
Cisco Systems
United Kingdom
Email: llechler@cisco.com
Kamil Wojcicki
Cisco Systems
Australia
Email: kamilwoj@cisco.com
Lechler & Wojcicki Expires 10 May 2026 [Page 13]