Minutes IETF123: mlcodec: Thu 15:00
minutes-123-mlcodec-202507241500-00
| Meeting Minutes | Machine Learning for Audio Coding (mlcodec) WG | |
|---|---|---|
| Date and time | 2025-07-24 15:00 | |
| Title | Minutes IETF123: mlcodec: Thu 15:00 | |
| State | Active | |
| Other versions | markdown | |
| Last updated | 2025-08-01 |
Machine Learning for Audio Coding (mlcodec) Working Group
Meeting minutes for MLCODEC @ IETF 123 in Madrid, Spain
17:00-19:00 UTC+2 Thursday, July 24, 2025, main floor room Patio 1
Chairs: Greg Maxwell, Mo Zanaty
Area Director: Orie Steele
Note Taker: Benson Muite
____________________
Administrivia - Chairs, 5 min
Note well, new area director, agenda bash, draft status
Opus extension mechanism - Timothy Terriberry, 10 min
draft-ietf-mlcodec-opus-extension
- 04 released yesterday
- Ready for last call?
MZ: No objections to last call. Will start last call this week.
Deep REDundancy - Jean-Marc Valin, 20 min
draft-ietf-mlcodec-opus-dred
after slide 2
ML: Will new weights to be the final weights?
JMV: Expect them to be similar to be final weights, but not final.
Jonathan Lennox: How will version id change?
JMV: version id changes, experimental portion will be removed in id in
final version
after slide 3
q?: Has computational complexity decreased
JMV: Yes, 1 multiply and multiply per weight
q?:
JMV: More sparsity in GRU and extra layers. Convolutional layers have
had reduced dimensionality
by adding additional layer before it
q?: Is this captured in the draft or on the list?
JMV: Ok, will capture what caused reduction
after slide 4
MZ: Will draft incorporate these vectors, or waiting until finalized
before incorporating?
JMV: Draft has some of this content, but needs more work with respect to
the test vectors. Will update the draft.
MZ: Will you update the test vectors in more details, reference repo for
now, bot will need to update to final location for test vectors file.
JMV: Need to figure out where to publish final test vectors file.
after slide 5
Timothy Terriberry: How the two models compare against vovoder
thresholds
JMV: Both pass with different test vectors, as the new models have a
different format
Yusuf Isik: When you compare the models, are there subjective quality
tests? If not the final any major changes planned?
JMV: Using objective metrics to compare the models. Should be good
enough for the current purpose, but trust this becuase original tests
also had qualitative tests.
MZ: Tests are binary pass fail?
JMV: Yes, pass fail. Tell you how close to reference quantizer. Also
compare two decoders, compare reference decoded and own decoded.
JMV: Upcoming changes - looking for extra sparsity. Will need to figure
out how to do final model. Contact if interested in testing.
q?: More feedback from listening tests on original models.
JMV: Good to verify what should be the final model should be good.
Document how the final model performs.
after slide 6
Timothy Terriberry: Original opus draft handling packet loss was non
normative. Ok to have this optional.
JMV: Ok. Comparing with original, but want optional
MZ: Not recommending, but saying if you want to you can.
?: Provide a tool to detect errors.
JMV: Not alligning things should show up. MaY or SHOULD still looking
for suggestions.
Timothy Terriberry: Thinks SHOULD is the right level. Some
implementations might not need to handle packet loss.
JMV: If you make a very good implementation, should be able to detect
this.
Timothy Terriberry:
JMV: Definitely not MUST.
MZ: Guidance
JMV: Probably SHOULD unless get more feedback.
after slide 7
MZ: Chairs will take a decision on where to publish weights. Will ask a
few others and look for an archival location. Length makes using base64
in text infeasible.
JMV: Both weights and test vectors. For OPUS was in a proceedings, but
not ideal.
MZ: Proceedings is not ideal place. OPUS method not great for archiving.
When is good enough, good enough? Maybe discuss after final
presentation.
Timothy Terriberry: Proceedings url was not as permanent as expected.
Speech coding enhancements - Jan Buethe, 20 min
draft-ietf-mlcodec-opus-speech-coding-enhancement
MZ: What is your distortion?
JB: Take energy in bands, apply a noise floor,
then look at L2 norm overtime.
MZ: IS it in draft
JB: Not in draft yet, in reference python file in branch
JB: Plan to add it to the draft pending feedback?
JB: If people are supportive will add to the draft?
MZ: What is the purpose
JB: Illustrate that if have a bad extensions, test will fail.
MZ: Strictly lower?
JB: Pathological corner case? SHOULD requirement
Yusuf Isik: Just for this change or any enhancment methodology?
Plausible high frequency but still good speech. What if combine low and
high frequency
JB: No plan to specify any enhancment algorithms. Should applyto all of
them. Not high band correlation, but energy correlation. If comparing
signal in high band would be uncorrealted. If look at energy in high
band, they should be correlated.
Artifacts if combine high and low band, not guarantee improvment. A
guard rail. Would need to do qualittive tests.
JMV: This new proposal was enhancement agnostic in original draft. Work
for generative
q?: Expect to use same extension for wide and narrow bands?
JMV: Not applicable.
JB: It is clarified for wide band to higher bandwidth extensions.
slide 10 end
MZ: Whend doing quantization experiments, will you try to sparsify some
leayers
JB: Not much benefit when tried before. Not sure if will try again.
MZ: No 3x
JB: Expecting about 100Mops
MZ: Licensing for test material? Will it be part of normative spec
JB: Tests will be part of normative spec.
MZ: CHAIRS Check CC-NC licensing for EARS data set
Scalable quality extension - Jean-Marc Valin, 20 min
draft-valin-opus-scalable-quality-extension
JMV: Is draft ready for adoption?
MZ: No opposition in last meeting to adoption. Will start list
confirmation for adoption. No objections in the room.
Speech quality and intelligibility testing - Laura Lechler, 20 min
draft-lechler-mlcodec-test-battery
Jonathan Lennox: Looking at double talk/cross talk
LL: Not today, but end of slides
slide 8
JMV: Unsure about real world condition claim. What are you targetting in
practice? Many software applications use speech enhancment modules
LL: Might still be some residues of reverberation and noise even with
speech enhancement modules. There should be a way to
MZ: Thought JM tried . Was that work irrelevant?
JMV: Should measure it. Should not care too much. But if can improve it
for free, would do so.
?: Original sound. Curious. DRED does not do great in recovring.
LL: In a conference call, may want soem of noise and reverb preseerved
to give more authentic feel of the room.
?: Was trainign DRED deverberate, difference sounded bad
JMV: If train OPUS to remove noise, DRED gets added get a choppy signal.
Do not need to remove it in DRED. Not a good result.
slide 11
MZ: Meeting slides have audio as well.
JMV: Characterization of DRED as a generative model, not quite.
Autoencoder is enteriley deterministic. Vocoder is close to generative,
but is very limited in amount of hallucination due to training process
and size of model.
slide 13
JMV: Is this a lower bound on the word error rate (WER)?
LL: WER?
JMV: If were to test with full sentences should this give similar
rankings to a word error rate?
LL: Yes, but this should be more sensitive than WER tests.
JMV: What is the threshold at which conversation breaks down?
LL: No threshold, depends on purpose. In DRED would lower threshold. If
there is amodel, depends very much on use case. No cut off
recommendations.
JMV: At which point do people give up.
LL: Would be interesting
Yusuf Isik: May want to look at details to see where product is good and
where t fails.
LL: Going there.
Kamil Wojcicki: Way more sensitive. Very diagnostic. Completely ignores
context and natural conversation issues.
slide 14
JMV: Subtracted two matrices
LL: Yes
Questions:
?:
Kamil : There are use cases where intelligibility is important, eg
digits, name. Cases where context can help recover meaning.
JMV: Not a generative error, if bit rate is too low, would still get
same/similar error.
KW: Fargan trained in adverserial fashion
JMV: Small adverserial component. Mostly due to feature quantization.
"G" vs "D" pitch voicing got quantized too coarselyu
MZ: Not going to see hallucinations between yes and no. Things that may
be close, "can" or "can't" could be close enough.
JMV: Get those confusions even without a codec. Have variable bit rate
within the redundancy packate to maximize . Entirely deterministic
quantization issues.
KW:
JMV: Compare pure PLC and OPUS, feature predictor is not generative -
predictable. Vocoder synthesizes what is has been told to synthesisze.
?: Intelligibility problems. Would be interesting to compare with
FARGAN.
JMV: Not observed this with FARGAN. Impact of adverserial training is
small. Gives a good vocoder.
Jan Buethe: Very hard to translate approach to what would happen when
using DRED. Very low bitrate is used rarely. Results are interesting
though. For communication, absolute category rating might be more
applicable.
LL: Agree, providing tools to help improve and point out difficulties.
JB: "Teest Battery"
LL: Methods to compare new codecs. Provide new tools, not a pass or fail
tests for codecs. Looking for input.
JB: Hard to put things in linear order.
LL: Always tradeoffs.
Jonathan Lennox: How coarse
JMV: Even at 2Kb the quantizer introduces more error than the vocoder.
JMV: Could do other bit rates, but defeats purpose of redundancy.
KW: Thought stand alone maybe useful if
How do you compare if have reached a similar level of quality. Packet
loss scenarios are the next thing want to examine and setings to use in
such tests.
LL: Let us know if have other codecs to evaluate.
JMV: What is the purpose of the document? Do you want it to go through
IETF process? Characterization of DRED?
LL: Not decided yet. Not expecting a forced requirement. A
recommendation. Tools for people to use.
MZ: Interesting independent evaluation. Does not need a new charter
item. But interesting work. Informing other work. Not a WG deliverable.
JMV: Could be a section in the DRED document.
?: DRED document is not normative on behavior.
Yusuf Isik: These test vector based assesment might not be sufficient.
Still gives responsibility on researcher to evaluate. If using a cloud
sourcing platform, different groups may have trouble reproducing the
work.
MZ: Binary tess in test vectors. This work helps qualitative
improvements.
JMV: They have orthogonal goals.
LL: Input on specifics for packet loss testing.