Minutes IETF120: mlcodec: Fri 16:30
minutes-120-mlcodec-202407261630-00
| Meeting Minutes | Machine Learning for Audio Coding (mlcodec) WG | |
|---|---|---|
| Date and time | 2024-07-26 16:30 | |
| Title | Minutes IETF120: mlcodec: Fri 16:30 | |
| State | Active | |
| Other versions | markdown | |
| Last updated | 2024-08-05 |
Machine Learning for Audio Coding (mlcodec) Working Group
MLCODEC @ IETF 120 in Vancouver, BC, Canada
09:30-11:30am PDT Friday, July 26, 2024 in room Plaza B, 2nd floor
Chairs: Greg Maxwell, Mo Zanaty
Area Director: Murray Kucherawy
Note Taker(s):Jonathan Lennox
Join Meeting: https://meetings.conf.meetecho.com/ietf120/?session=33199
Onsite Tool :
https://meetings.conf.meetecho.com/onsite120/?session=33199
Take Notes : https://notes.ietf.org/notes-ietf-120-mlcodec
Chat Room : https://zulip.ietf.org/#narrow/stream/mlcodec
Add Calendar: https://datatracker.ietf.org/meeting/120/session/33199.ics
Agenda
Administrivia - Chairs, 5 min
Note well, agenda bash, draft status
Opus extension mechanism - Timothy Terriberry, 25 min
draft-ietf-mlcodec-opus-extension
- Repeat These Extensions:
Jonathan: The idea looks good, I didn't quite follow the algorithm, but
since you're the one who has to implemetent it I don't care that much
Mo: Is the position of the RTE extension significant?
Tim: The position is significant, you repeat all extensions you've seen
so far in the current frame
Jonathan: This is an audio codec, implementation complexity to make
things smaller is the name of the game, just make sure you have a full
test case.
Tim: I have more lines of tests than code for this.
Jean-Marc: Contiguous extensions aren't needed, only issue is
implementation complexity, I want to take time to evaluate that, but
otherwise I have no issue, it makes sense.
Tim: There are some tricky corner cases
Jean-Marc: Can you describe the tricky cases?
Tim: E.g. if there's padding before an RTE you don't want to repeat the
padding
Jean-Marc: What happens if there's more than one RTE in the same frame?
Should we disallow that?
Tim: You can say an encoder MUST NOT, but the decoder still has to
handle it; currently I only repeat extensions after the most recent RTE.
It'd be silly to do it multiple times in the same frame but it doesn't
hurt anything.
Jean-Marc: More than one extension of the same type in the same frame,
they both get repeated?
Tim: Yes
Jean-Marc: It's possible we wouldn't use RTE for any of our current
extensions, if the first thing that uses it is five years from now, we
wouldn't want anything to be broken.
Tim: Other than writing good tests I don't have a good solution for
that.
Mo: What was the motivating use case for this?
Tim: Original motivation was an enhancement that adds side information.
From our latest experiments the enhancement information is already not
paying for itself.
Tim: Possibly also dynamic range control metadata. There are certainly
going to be other extensions people want to encode on every frame.
Mo: I'm sensing some concern about complexity, Tim can you write up the
spec text? Then people can evaluate.
Tim: I can do that.
- Extension numbering
Mo: I sent an alternative mapping to the list, but I think your mapping
works. I thought you were still subsetting 8..63?
Tim: 3 bits of that are length
Tim: Maybe if everyone wants a 2 byte extension maybe we will run out
Jean-Marc: I notice 6 and 7 are reserved, including their length, what
does that mean?
Tim: If you see unsafe extensions, throw the whole extension away, you
don't know how to parse it
Jean-Marc: That might make sense if there's another extension we need
Mo: Any objections to going with this proposal?
[No objections]
Mo: Tim, go ahead and put this in the spec
Mo: Reminder, we kept this draft open because we wanted to get
information from extension authors, I think we're getting closer to
freezing it
Tim: If you leave it open I'll keep coming up with more problems to
solve
Deep REDundancy - Jean-Marc Valin, 25 min
- Table representation
Tim: I am a fan of fixed point over floating point
Jean-Marc: One thing to keep in mind these are trained as floating
point, the fixed point is just dumb rounding.
- Block-sparse weight matrices
Tim: If you already have an accelerator, the complexity is probably not
your top issue.
Jean-Marc: You probably wouldn't want to take a 10x hit, which is why
I've gone conservative on the sparseness
Mo: Is this the result of aggressive quantization? What is the cause of
the sparseness?
Jean-Marc: This is enforced sparseness, not all the connections are
equally important; with the same number of non-zero weights you get
better quality.
- Normative feature computation
Jan Buethe: I think if they're not normative there should be some
quality test. This may already be necessary because we don't specify the
encoder. I'm skeptical if you can change the features, especially pitch.
Jean-Marc: The question of features is tightly tied to what we say for
the vocoder. If we don't freeze the vocoder we need to say something
there.
Jan: If you start changing these features you need to make sure it works
with all vocoders. I think the same thing is already true for not having
a mandatory encoder.
Jean-Marc: The definition of the features plus the vocoder has to be
sufficient for interoperability, the question is how we shift them.
Mo: Do you have a recommendation on any of these issues?
Jean-Marc: About sparsity, I think we should keep sparse decoders. How
we specify them I don't have a strong opinion - I would tend to lean
toward matching what the implemenation has, that means we're less likely
to have errors.
Mo: If you have an opinion you should state it on the list, I think
there's not enough experts to make a well-informed decision on this,
people will have a better time starting with an opinion and digging into
whether they agree with it.
Jean-Marc: I'll write the draft based on my own design decisions, then
people can bash it.
Mo: Just note where you have design decisions so people can evaluate it.
Where are you leaning on whether the vocoder is normative?
Jean-Marc: Definitely leaning toward not specifying the vocoder
algorithm. Still experimenting with specifying requirements, taking test
sequences and seeing whether the output is within certain boundaries.
Mo: So test conditions or evaluation criteria?
Jean-Marc: Yes
Jonathan: Should the weights file be a MIME media type?
Jean-Marc: This is more like source code, it's never expected to be
distributed on the wire.
Greg: So implementations would change the format?
Jean-Marc: The format is designed for the current implementation, but
other implementations would probably want to change it.
Mo: Before publishing the weights, should we specify the model and
training set that these weights came from?
Jean-Marc: All of the data and scripts are publicly available, some of
the options are not, ideally it should be repeatable but there's a lot
of random stuff in the training. I'm not sure what's the best way to do
it. There's a way to verify that the weights are a local minimum for
that training data.
Mo: I think if you could document where that came from and how to
reproduce it, I think it'd give a lot more confidence from the
community.
Greg: As an individual, in AI right now, there's a big debate about what
counts as open source. This isn't a case where we have terabytes of
training data. It'd be good if it was reproducable, even if won't be
bit-exact.
Jean-Marc: The initialization is random, and it depends on the exact
versions of the tools. There's three ways we could go:
- An IETF-trusted system to run the training on
- Show that continuing training on this data, the gradient is zero.
This doesn't prove it's not just a local minimum though. - Have multiple groups train the same data, show we get the same
value, not sure if that's possible. There's a lot of random state.
Jan: There are lot of blogs about making network state repeatable, it's
notoriously difficult, random seeds don't get passed to subprocesses
etc. Best we can do is have a Docker image, you train it, you get within
epsilon in the L2 distance.
Jean-Marc: Who would be interested in helping to make a training model
that's repeatable?
[Mo, Greg raise hands]
Jean-Marc: The current models have been trained with a single nVidia
4090 GPU and a single machine with 192 GB of RAM, 100 GB of training
data.
Mo: So you can do it with a beefy workstation, you don't need a cloud or
an entire company.
Jean-Marc: This is literally done on the machine in my office.
Jean-Marc: Not everything is perfectly automated, some are multiple
command lines where you check if everything has converged, no trivial
way to automate it.
Mo: This is not in the current draft?
Jean-Marc: Training procedure is not documented, no.
Mo: Can we get that somewhere?
Jean-Marc: Yes. All the code is in a Git repo, I can provide the hash
and repo.
Greg: At this point we're more interested in participation than
specification. Other people can try.
Mo: If multiple people try and also agree it's intractable, then we can
decide how to move forward from there
Jean-Marc: Two separate questions, how do we specify weights, and how do
we decide they're not dodgy?
Mo: Not so much dodgy but understandable, can people understand how this
was done? If weights are reproducable, they're just an artifact, don't
need to be specified.
Jean-Marc: There's a difference between we need these weights and we
need this behavior.
Greg: You're already accepting that if you say people can use different
precisions of the weights?
Jean-Marc: There's a continuum. We don't want bit-exactness, but saying
you can retrain totally different weights, we'd need to be very
confident in our test vectors.
Mo: So two decision points, are we ok with specifying weights as a
binary blob with no way to reproduce; secondarily, how do we specify,
normative weights, or procedure for generating them? And third, if they
are a binary blob, how does the IETF specify that if it's too big for an
RFC.
Mo: Can you provide that information for people who want to get started?
Jean-Marc: Sure. Most of it is already there in Git, just need some
command lines.
Jonathan: People also might want to evaluate whether the training data
is well-chosen.
Jean-Marc: The training data is an open-source set, represents many
languages, represents what we could find in free training data.
Jan: What you can do for the data, you can save all the steps that you
do, save every batch you put into your model, people could trace back
where the data came from. Could validate within floating-point
precision.
Jean-Marc: That would be hundreds of terabytes.
Cullen: I'm glad to help on the figuring out to get whatever we do
through the IETF process.
Mo: We're setting a precedent for how any standards body specifies ML
anything. I don't see a lot of threat vectors on this, but imagine if
you were training an ML model for good or bad URLs.
Jean-Marc: Is there precedent at IETF when a random value is needed?
Cullen: You see this in security protocols, we take some string that's
hard to manipulate, then feed it through a common hash algorithm. But in
general if we can get that this is not too dubious, various people have
validated, I think we can get it through the process. But this is
setting a lot of precedent, we're going to have a lot of hardship.
Mo: There's a lot of randomness just from e.g. asynchronous processes
that wouldn't be reproducable. But I'm hopefully we can at least get
close.
Speech coding enhancements - Jan Buethe, 25 min
draft-buethe-opus-speech-coding-enhancement
Jonathan: Is this standardizing anything?
Jan: Just to make it conformant.
Jan: There was a test to add side information, but it didn't pay for
itself. It might be interesting to add for music
Jean-Marc: For speech, we can't do lower than 6 kbps. NoLACE has been
working so well that at 9 kbps it's close to transparent. The range
where it's relevant is 6-8 kbps, and we have 1 kbps of overhead.
Mo: Will there be anything for us to standardize if you're of the
opinion that a non-signalling-based method works reasonably well.
Jan: The conformance requirement is a standard. You should have a
safeguard that you don't put something in that doesn't work.
Jean-Marc: There are a few things we want to guard against, e.g. people
deploying anything without validating, another would be systems relying
on a modified encoder and interpreting things differently; we just want
to set a boundary on what is ok to do and what is not.
Tim: Have you thought about an extension that would turn this off on a
frame-by-frame basis?
Jan: I thought about it, but I don't see the use case. Maybe also
because I haven't found a problem with the enhancement methods yet.
Jean-Marc: I see a few issues with turning it on and off; you spend
quite a few bits signaling it. I'm not sure it's a good idea unless
there's something's reasonablly compelling.
Greg: Have you done more music testing? That seems likely to be the area
where there's a problem.
Jan: Music is unlikely to be coded by the speech codec, but I can run
the SQAM CD test cases through it.
Jean-Marc: Music is going to be bad at 6 kbps whatever you do.