Machine Learning for Audio Coding (mlcodec) Working Group

IETF 118 Prague, CZ
Tuesday, November 7, 2023
17:00 - 18:00 Prague time
08:00 - 09:00 Pacific Time
Room: Berlin 3/4

Meeting link: https://meetecho.ietf.org/client/?session=31749
Notes: https://notes.ietf.org/notes-ietf-118-mlcodec

Chairs: Greg Maxwell, Mo Zanaty
Notetakers: Emily Heron
Chat Scribes: Jonathan Lennox

Agenda

Administrivia (Chairs) 5 min
Note Well
Agenda Bash

No agenda bashing

Deep REDundancy draft-valin-opus-dred (Jean-Marc Valin) 15 min
Status update

Start at 5:02

Recap from last session
Goal: Make Opus robust to long bursts of packet loss
Proposal: code large amounts of redundant audio

Proposed Format:
Use extension code 32 - include more than just one byte
Offset: position of redundancy in packet
Decode until fewer than 8 bits remain

  Mo: Ind comment: any coupling between Opus frame sizes and these redundancy or are they independent?
  A: Completely independent. Encode in chunks of of 40 milliseconds redundancy
  M: So no coupling?
  A: Not simple, but no.

Normative Aspects:
Have a normative specification for the part that converts bits into
features.

Implementation Update:
Improved quality from vocoder
Complexity reduced from 10% to 3% CPU for high loss
Weights down from 17 MB to 4 MB

Tim Terriberry: 4 megabytes: what fraction is is that is the bits to
feature decoder?
A: About one megabyte. Needs to be smaller.

Open Questions

  1. SHould there be a maximum duration allowed?

    1. technically we could do up to approc 10 min
    2. proposal: no hard limit
  2. What are the lowest and highest useful bitrates?

    1. currently support 10 to 100 kb/s for 1 second redun

Jonathan Lennox: Gen question: concerned what happens if I splice two
streams? Is there a way to splice redundancy history?
A: Interesting use case, had not thought of this before

JM will consider how splicing will work. Can we actually merge the
redundancy and get redundancy for talkers A & B? This is what he needs
to think through.

Jonathan Lennox: You said that the frame sizes don't matter. If you are
encoding at frame size smaller then there is a small gap in time between
the redundancy block and the current block, which is indicated by the
offset time, and PLC should conceal it.

Mark Harris 17:22
The Security Considerations should highlight the danger of potentially
including earlier audio that was intended to be cut out, perhaps
confidential information, that can be decoded from DRED.

A: If it was in the redundancay it was already included in the Opus
packet.

Tim: For quantizer slope, would it be useful to specify a floor higher
than the minumum?
A: Open to that, it would be more bits. Would welcome feedback.
T: At 10 minutes of redundancy at high bitrates, you are going to hit
the minimum with any non-zero slope. There will be some period of time
between 1 second and 10 minutes where you might want to stop higher in
bit rate than the minimum.

Speech coding enhancements draft-buethe-opus-speech-coding-enhancement
(Jan Buethe) 20 min
Status update

Start 5:28

Opus Speech Coding Enhancement
Focus on quality today
Gold standard for evaluation: subjective listening test
Very costly
Metrics under consideration:
PESQ
WARP-Q
MOC
NOMAD
Comparison to listening test results (MOS)
Not perfect, but reasonable.
Detecting Degredation
Goal: Distinguish good from bad enhancement models
All four metrics seem capable of separating good models from bad
models.
NOMAD seems favorable to other metrics but difficult to standardize.
WARP-Q and MOC easier to standardize
Next Steps:
Algorithm Development
Standardization

Happy to take opinions

Questions?

Tim Terriberry: Table slide: do the metrics disagree about if there is
an improvement at high bit rates?
A: Yes. I believe the metrics are incorrect. Best to have a listening
test at higher bit rates. Led to believe it is a shortcoming of the
metric.

Mo: Some people created composite metrics. Do you see that here?
A: No, these are standalone. Something to look into.

Jean-Marc Valin: Signal is high passed version that has passed filter
that Opus uses internally, Not the only chnage the opus encoder makes to
original signal. Take some care as to what the correct reference signal
should be.
A: The reason for taking this signal is you get face shifts. Will
re-run.
Tim Terriberry: Are you degrading orginal input? If Opus encoder is
doing enhancment to speech, are these methods that get closer to orginal
input undoing the enhancements?
A: Have to check.
Tim Terriberry: Should there be though? Will you be penalized if you
dont undo what the Opus encoder did?

JM: Only exception should be the high pass filter.

Opus extension mechanism draft-ietf-mlcodec-opus-extension (Timothy
Terriberry) 15 min
Status update

Start: 5:44

Draft Status: Published as WG draft

Updates since SF
Reserved ID 127
Quoted text from RFC 5576 "media-level format parameters MUST NOT be
carried over blindly"
Q: Jonathan Lennox: What frame is the dread associated with?
A: It would be useful for it to be on the first frame

Q: Jean Marc: Goal of frame separator is not for dread, but for some of
the extensions we are planning.

Clarified support for extension IDs 0 and 1 does not need to be
explicity signaled via a=fmtp

Two Future Extension Mechanisms?
ID=0, L=0
ID=127

Changes Not Made
Did not split out IANA registration for L=0 and L=1 modes for IDs
2...31
Did not switch to QUIC varint for extension IDs
Did not reserve "unsafe" extension IDs

Mo: Not specifically enamored with QUIC varints, just in general there
are many variable link coatings that are popular.

Worth considering other use case proposals re: QUIC varint for extension
IDs

Questions?

Waiting for feedback, nothing currently in queue. Currently milestones
say that this is going to the IESG in Dec, which seems soon. Need
readers for document. Feedback needed.

Done 6:00pm