Network Working Group                                     V. Sviridenko
Internet-Draft                                                S. Ikonin
Intended status: Standards Track                               D. Yudin
Expires: February 09, 2012                                   SPIRIT DSP
                                                        August 09, 2011


                           IPMR Speech Codec
                         draft-spiritdsp-ipmr-01.txt

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with
   the provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time.  It is inappropriate to use Internet-Drafts as
   reference material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on February 09, 2012.

Copyright Notice

   Copyright (c) 2011 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document.










Sviridenko, et al.     Expires February 09, 2012               [Page 1]


Internet-Draft             IPMR Speech Codec                August 2011


Abstract

   This document describes IPMR, a scalable variable adaptive multi-
   rate speech and audio codec designed for use in IP based networks.
   This codec is suitable for real time communications such as
   telephony, voice&video conferencing.Four different sampling
   frequencies are supported for encoding the audio input signal.
   Adaptation to network characteristics is provided through control of
   bitrate, packet rate, packet loss resilience and use of discontinuous
   transmission (DTX).
   IP-MR support different profiles for input signal content which
   should be specified during codec initialization. It can be in Speech,
   Audio or Auto-detection mode. In Auto-detection mode codec recognizes
   type of input content automatically and switch to appropriate Speech
   or Audio mode automatically.



Table of Contents

   1. Intoduction ....................................................3
   2. Technical Rrequirements ........................................4
     2.1. Voice/Audio Quality ........................................4
     2.2. Sampling Rate ..............................................4
     2.3. Adaptive Multi Rate ........................................4
     2.4. Bitrate Scalability ........................................4
     2.5. Packet Loss Resilience .....................................4
     2.6. Delay ......................................................4
     2.7. DTX ........................................................5
   3. IP-MR Codec Description ........................................5
   4. Algorithm Overview .............................................8
     4.1. Coding profiles ............................................8
     4.2. Mixed CELP/MDCT codec ......................................9
     4.3. Scalable CELP-based encoder ...............................11
     4.4. Scalable CELP-based decoder ...............................13
     4.5. Scalable MDCT-based encoder ...............................14
     4.6. Scalable MDCT-based decoder ...............................16
   5. Security Considerations .......................................19
   6. Informative References ........................................20
   7. IANA Considerarions ...........................................21
   Authors' Addresses ...............................................22









Sviridenko, et al.     Expires February 09, 2012               [Page 2]


Internet-Draft             IPMR Speech Codec                August 2011



1.  Introduction

To ensure high-quality IP audio transmitting the codec has to overcome
a set of problems and obstacles. The best codec should be able to work
at a wide range of bitrates with relatively small delay, should deliver
high quality speech even in case of packet losses and poor network
connection and should be able to provide wideband quality (which is a
must for today's biz-level communication) and ultra wideband quality
for next-generation applications. This document describes the IP-MR
codec which is scalable variable adaptive multi-rate speech and audio
codec designed for use in IP based networks.









































Sviridenko, et al.     Expires February 09, 2012               [Page 3]


Internet-Draft             IPMR Speech Codec                August 2011

2. Technical Requirements
We agree with some technical requirements described in [SILK] and
include them into this section. The Internet Wideband Speech/Audio
Codec must be optimized towards real-time communications over the
Internet, and must have the flexibility to adjust to the environment it
operates in. Below is a list of main requirements for the codec.

2.1. Voice/Audio Quality
The codec should provide a quality/bitrate trade-off that is
competitive with other state-of-the-art codecs. At low bitrates it
should deliver good quality of speech in any language. At high bitrates
the quality should be excellent for any audio signal, including music,
at standard conditions.

2.2. Sampling Rate
Audio bandwidth is determined by the codec sampling frequency - 8 kHz
for narrowband voice (PSTN) and 16 kHz for wideband. Obviously,
wideband speech is much more natural and comfortable and wideband
codecs are more convenient to use in IP communication. However,
sometimes there isn't enough bandwidth to allow 16 kHz sampling
frequency, and codec must be able to switch to 8 kHz. Moreover, codec
should support ultra wide band (20 kHz and more) for next-generation
high-end quality.

2.3. Adaptive Multi Rate
The codec should have a set of bitrates with needed granularities to
fit into different channels capacities. The bitrates should be
adjustable in real-time. The codec should be capable of running at
bitrates starting from 6 kbps.

2.4. Bitrate Scalability
Codec should have bitrate scalability feature (embedded or layered
structure of bitstream) to enable reduce voice traffic during
transition without re-encoding. This is necessity for dynamic
congestion control, multicast and conferencing applications. From the
other hand the payment for scalability is less compression efficiency
and more computational complexity at the same bitrate. Because of that
it will be good if scalability feature can be switched-off when it's
not needed.

2.5. Packet Loss Resilience
The codec should be capable of running with little error propagation,
meaning that the decoded signal after one or more packet losses is
close to the decoded signal without packet losses after no more than
two additional packets. The codec should have a packet loss resilience
that is adjustable in real-time, where a lower packet loss resilience
setting improves the quality/bitrate trade-off.

2.6. Delay
For comfort conversation the codec must have algorithmic delay not more
than 50 ms.



Sviridenko, et al.     Expires February 09, 2012               [Page 4]


Internet-Draft             IPMR Speech Codec                August 2011


2.7. DTX
The codec should be capable of using Discontinuous Transmission (DTX)
where packets are sent at a reduced rate when the input signal contains
only background noise.

3.  IP-MR Codec Description
The IP-MR codec is scalable variable adaptive multi-rate speech and
audio codec designed for use in IP based networks. This codec is
suitable for real time communications such as telephony, voice&video
conferencing.

Sampling rate
IP-MR support three sampling rate modes: 8, 16 and 32 kHz

Speech/Audio modes
IP-MR support different profiles for input signal content which should
be specified during codec initialization. It can be in Speech, Audio or
Auto-detection mode. In Auto-detection mode codec recognizes type
of input content automatically and switch to appropriate Speech or
Audio mode automatically.

Voice Quality
The Mean Opinion Score (MOS) of this speech codec's speech quality
is about 3,7-4,4 (for clean speech) and it's depended on current mode
and average bit rate. At higher bitrates codec achieves FM quality on
generic audio content.

Algorithmic delay
The frame length is 20 ms. Algorithmic delay varies from 35 to 50 ms
depending of coding profile.

Adaptive Multi Rate
Depending of sampling rate IP-MR has 8 or 10 bitrate modes between
6 and 120 kbps which can be changed in real time in compliance with
the current network conditions.
















Sviridenko, et al.     Expires February 09, 2012               [Page 5]


Internet-Draft             IPMR Speech Codec                August 2011


+--------------------------------------------------------------------+
|Sampling |   Coding    | Frame |Algorith.| Number | Avg. Bit Rates  |
|  Rate   |   profile   | size  |  Delay  |of Rates|for active speech|
+--------------------------------------------------------------------+
|         |   Speech/   |       |         |        |                 |
|         |     Auto-   |       |         |        |                 |
|         |  -detection |       | 35 ms   |        |                 |
|         |    with     |       |         |        |                 |
|         |     short   |  20   |         |        |                 |
|         |     delay   |       |         |        |                 |
| 8 kHz   |-------------|       |---------|    8   |   6 - 50 kbps   |
|         |    Audio/   |  ms   |         |        |                 |
|         |     Auto-   |       | 50 ms   |        |                 |
|         | -detection  |       |         |        |                 |
|         |    with     |       |         |        |                 |
|         | long delay  |       |         |        |                 |
|--------------------------------------------------------------------|
|         |     Speech/ |       |         |        |                 |
|         |     Auto-   |       |         |        |                 |
|         |  -detection |       | 36.875  |        |                 |
|         |    with     |       |  ms     |        |                 |
|         | short delay |  20   |         |        |                 |
| 16 kHz  |-------------|       |---------|   10   |   6 - 70 kbps   |
|         |    Audio/   |  ms   |         |        |                 |
|         |   Auto-     |       |  50 ms  |        |                 |
|         | -detection  |       |         |        |                 |
|         |  with long  |       |         |        |                 |
|         |  delay      |       |         |        |                 |
|--------------------------------------------------------------------|
|         |    Speech/  |       |         |        |                 |
|         |   Auto-     |       |         |        |                 |
|         | -detection  |       | 37.8125 |        |                 |
|         |    with     |       |   ms    |        |                 |
|         | short delay |  20   |         |        |                 |
|  32 kHz |-------------|       |---------|  10    |   6 - 120 kbps  |
|         |    Audio/   |  ms   |         |        |                 |
|         |     Auto-   |       |  50 ms  |        |                 |
|         | -detection  |       |         |        |                 |
|         |  with long  |       |         |        |                 |
|         |    delay    |       |         |        |                 |
+--------------------------------------------------------------------+

Variable Bit Rate
Encoder's bit rate is constantly varying in accordance with the actual
speech content (voiced/unvoiced, pauses, stationary/non-stationary
voiced, etc.). IP-MR codec optimizes and reduces traffic while
keeping the efficiency, as the encoding is adaptive to the actual
characteristics of speech. All average bitrates are specified for
active speech without consideration of inter-speech (silence) regions.



Sviridenko, et al.     Expires February 09, 2012               [Page 6]


Internet-Draft             IPMR Speech Codec                August 2011

Bitrate Scalability

The coded frame has layered (embedded) structure. It consists of
multiple coding layers - base (or core) layer and several enhancement
layers which are coded independently. Only the core layer is mandatory
to decode understandable speech and upper layers provide quality
enhancement. These enhancement layers may be omitted and remaining
base layer can be meaningfully decoded without notable artifacts. This
making the bit stream scalable and allows reduce bit rate during
transmission without re-encoding.

Bitrate scalability provides additional possibilities for congestion
control. Some intermediate network node may modify the IP-MR codec's
payload by dropping some of the layers during transmission to meet the
available bandwidth requirements. In case the payload is forwarded with
modified content at least the base layer must be preserved in the
payload which is being delivered to receiving side guarantees
meaningful speech decoding without packet loss concealment procedure.

--+--------+--------+--------+--------+--------+--------+--------+--
  | f(n-2) | f(n-1) |  f(n)  | f(n+1) | f(n+2) | f(n+3) | f(n+4) |
--+--------+--------+--------+--------+--------+--------+--------+--

  <---- p(n-1) ---->
           <----- p(n) ----->
                     <---- p(n+1) ---->
                               <---- p(n+2) ---->
                                        <---- p(n+3) ---->
                                                 <---- p(n+4) ---->


But because of the scalable nature of IP-MR codec there is no need to
duplicate the whole previous frame - only the core layer may be
retransmitted. This reduces redundancy overhead while keeping
efficiency.

Moreover, the speech bits encoded in core layer are divided on six
classes (from A to F) of perceptual sensitivity to errors. Class A
contains most perceptually significant bits. This class's bits should
be delivered to Decoder to exclude fully "error propagation". Class F
contains less significant bits. Sum of all classes from A to F
contains all encoded parameters of the first (core) encoding layer.
These parameters are sufficient to synthesize speech with near "toll
quality".

Using these classes as introduced redundancy make possible to smoothly
adjust trade-off between overhead and robustness against packet loss.

DTX
IP-MR codec support Discontinuous Transmission mode for silence
compression. During silence intervals the codec bitrate can be reduced
to 0.3 kbps.

Sviridenko, et al.     Expires February 09, 2012               [Page 7]


Internet-Draft             IPMR Speech Codec                August 2011


4.  Algorithm overview

4.1. Coding profiles
IP-MR support different profiles for type of input signal content. It
can be Speech, Audio or Auto-detection modes. In Auto-detection mode
codec recognizes type of input content automatically and switch to
appropriate Speech or Audio mode automatically. At high level encoder
consists of three basic modules (see Figure 1).

   -Speech/Music detector - automatically classify type of input
content as speech or music to enable appropriate coding model.
   -CELP-based speech coder - implements source-filter model, speech
content oriented.
   -MDCT-based audio coder - for general audio coding purpose.

               +-------------------+
               |Predefined Speech/ |
               |       Audio       |
               |      Profile      |
               +----------+--------+
                          |
                         \|/
               +----------+-------+
  input signal |       Speech/    |
---------------+  Music detector  |
               +---+---------+----+
                  S|        M|
                  P|        u|
                  e|        s|
                  e|        i|
                  c|        c|
                  h|         |
                   |         |
    +..............|.........|..........+
    .             \|/       \|/   coder .
    . +------------+--+   +--+-----+    .
    . |   CELP/MDCT   |   | MDCT   |    .
    . +--------+------+   +----+---+    .
    +..........|...............|........+
               |               |
              \|/             \|/
        +------+---------------+--+
        |        Bitstream        +--->
        +-------------------------+

      Figure 1 High level encoder structure






Sviridenko, et al.     Expires February 09, 2012               [Page 8]


Internet-Draft             IPMR Speech Codec                August 2011

Depending of type of input signal (speech/music) different coding
models are used. The type of input signal can be detected automatically
in 'Autodetection' mode or specified as predefined setting during codec
initialization. The speech content is coded by mixed CELP/MDCT based
model. General audio content is coded by pure MDCT-based model.

The decoder does backward operations. First, compressed frame goes to
CELP-decoder; it extracts core and extension layers. Then, both the
rest of bitstream and reconstructed signal go to MDCT-decoder which
restores residue and generates joint output.


              +----------+  Rest of compressed   +--------+
 Compressed   |          |        data           |        |
   frame      |  CELP    +---------------------->+  MDCT  |
------------->+          |    Reconstructed      |        |
              | decoder  |       signal          |decoder +--OUTPUT->
              |          +---------------------->+        |
              +----------+                       +--------+

                Figure 2 High level decoder structure

In fact CELP and MDCT are two different decoders and thus, they can
work simultaneously. Parallel processing requires only two modules to
be carried out of decoder structure (see Figure 1) they are - bitstream
demultiplexing and signal mixing.

                           +---------+
                           |   CELP  |      +---------+
                        +->+ decoder +----->+         |
 Compressed            /   +---------+      |   MDCT  |
   frame      +-------+                     |         +--Output-->
------------->| DEMUX |                     | decoder |
              +-+---+-+    +---------+      |         |
                       \   |   MDCT  +----->+         |
                        +->+ decoder |      +---------+
                           +---------+

       Figure 2 High level decoder structure (parallel)


Note, that demultiplexing is simple to implement because of the size of
CELP stream portion can be calculated without decoding.

4.2. Mixed CELP/MDCT codec

The mixed CELP/MDCT Codec is composed from two independent codecs -
CELP and MDCT-based. The first one processes source signal and feeds
the residue to the second. In order to provide flexible and transparent
coupling between codecs, corresponding sampling rate conversion and
frame synchronization procedures are applied.


Sviridenko, et al.     Expires February 09, 2012               [Page 9]


Internet-Draft             IPMR Speech Codec                August 2011

The resulting bitstream naturally constructed from two continues
regions belong to CELP and MDCT codecs correspondingly. The CELP-codec
bitstream has a layer structure (core + extensions) while the
MDCT-codec generates byte-scalable stream.

The next figure provides an example of 16 kHz source material encoding
if CELP-base encoder operates at 8 kHz sampling rate.

                                                   Core layer
                  +------------+   +------------+     params
-Input speech-+-->| Downsample +-->|   Scalable +--------------+
 FS=16 kHz    |   |   to 8 kHz |   | CELP-based |              |
              |   +------------+   |  Encoder   +---+          |
              |                    +--+---------+   |          |
              |                       |             |          |
                                 Synth Speech       |          |
              |                       |         Enhancement    |
              |                       |           layers       |
              |                       |           params       |
              |                      \|/            |         \|/
              |            +----------+---------+   |   +------+-----+
              |            | Upsample to 16 kHz |   |   | Core layer |
              |            +-----+--------------+   |   +------------+
              |                  |                  |   | Ext.layer 1|
              |                 \|/                 |   +------------+
              +---------------->(-)                 +-->+ Ext.layer 2|
                                 |                      +------------+
                                 |                      | Ext.layer 3|
                                 |                      +------------+
                            Residual                    |            |
                                 |                      |            |
                                \|/                     |  Scalable  |
            +--------------------+--+                   |  bitstream |
            |      Scalable         |    Scalable       |            |
            |  MDCT-based Encoder   +---bitstream------>|            |
            +-----------------------+                   +------------+

  Figure 3 Structural block diagram of mixed CELP/MDCT encoder
                               (16kHz mode)

First, input signal is down-sampled to 8 kHz and encoded by Scalable
CELP-based encoder which packs quantized parameters in layered
bitstream. The difference between up-sampled synthesized signal and
original source goes to Scalable MDCT-based encoder which forms the
rest of bitstream.

Below CELP and MDCT-based codecs are considered in more details.







Sviridenko, et al.     Expires February 09, 2012              [Page 10]


Internet-Draft             IPMR Speech Codec                August 2011

4.3. Scalable CELP-based encoder

Scalable CELP-based coder applied to speech coding consists of the core
(base layer) encoder and three enchancement encoders. In Figure 4 the
structure of core encoder is shown.

Core Encoder codes speech in a "base frequency bandwidth" (up to 4 kHz)
with speech quality near to "Toll Quality" and forms a coded bit stream
at minimum average bit rate (about 6.0 kbps). Current bit rate is
driven by information content of input speech and can vary in range
from 4.3 kbps up to 10.35 kbps.

The Core Encoder performs LPC analysis and pitch detection, estimates
parameters of the pitch-predictor and excitation by the
"analysis-by-synthesis" method on the "subframe-by-subframe" base.
The subframe length is 5 ms.

Encoded parameters and bits are separated to 6 sensitivity classes
from: Class A to Class F to provide a possibility of the additional
protection them against packet losses.

Class A contains most perceptually significant bits. This class's bits
should be delivered to Decoder to exclude fully "error propagation".

Class F contains less significant bits. Sum of all classes from A to F
contains all encoded parameters of the first (core) encoding layer.
These parameters are sufficient to synthesize speech with "toll
quality".
























Sviridenko, et al.     Expires February 09, 2012              [Page 11]


Internet-Draft             IPMR Speech Codec                August 2011

                                                                |
                                                           Input Speech
                                                            Fs=8 kHz
                                      +--------------+          |
                                      | LPC Analyzer +<---------+
                                      +------+-------+          |
                                             |                  |
        +------Codebook memory--+           LPC                 |
        |         vector update |           \|/                 |
       \|/                      |    +-------+-------+          |
    +---+------+                |    | LPC Quantizer +-LSFs->   |
    | Adaptive +--Pitch->       |    +------------+--+          |
+-->| Codebook |                |                 |             |
|   +------+---+                |                QLPC           |
|          |                    |                \|/            |
|          |                    |             +---+--------+    |
|          +-------------->(+)--+-Excitation->+ LPC-filter |    |
|                          /|\                +----+-------+    |
|         +-----------------+                      |            |
|  +------+---+                                  Synth.         |
+->|   Fixed  +                                  Speech         |
|  | Codebook +-Pulse information                  |            |
|  +----------+                                    |            |
|                                                 \|/           |
| +-------------+                                 (-)<----------+
+-+  Error      |                                  |
  |Minimization |                                  |
  |  Control    |                                  |
  +-------+-----+                                  |
         /|\                                       |
          |                                        |
          |       +------------+                   |
+---------+---+   | Perceptual |                   |
|    Error    |   | Weighing   +<------------------+
| Calculation +-->+   Filter   |                   |
+------+------+   +------------+                   |
                                              Residual 1
                                                   |
                                                  \|/


       Figure 4 Structural block diagram of CELP-based Core Encoder










Sviridenko, et al.     Expires February 09, 2012              [Page 12]


Internet-Draft             IPMR Speech Codec                August 2011

      |
Pulse information                                             |
from previous layer                       |               Residual
      |                                   |                  of
     \|/                                  |           previous layer
+-----+------------+                      |               (Fs=8 kHz)
| Adaptive Pulse-  |                    QLPC                   |
| Position Control |                 from core layer           |
+------+-----------+                      |                    |
       |                                  |                    |
      \|/                                \|/                   |
+------+---------+     Enhancement  +-----+------+            \|/
| Fixed Codebook +----  Layer   --->+ LPC-filter +----------->(-)
+---+------------+    Excitation    +------------+             |
   /|\                                                         |
    | +--------------+  +-------------+  +------------+        |
    | |    Error     |  |   Error     |  | Perceptual |        |
    +-+ Minimization +<-+ Calculation +<-+ Weighing   +<-------+
      |   Control    |  +-------------+  |  Filter    |        |
      +--------------+                   +------------+    Residual of
                                                         current layer
                                                              \|/


      Figure 5 Structural block diagram of CELP-based Extension Encoder

The difference between input speech and synthesized speech (by Core
Encoder) is delivered to extension coding. Each next Extension Encoder
codes the residual (delivered from previous layer) and forms own
additional coded bit stream. Therefore, full bit stream contains a sum
of the base and extension bit streams. The number of layers, which is
used at coding and corresponded to number of the bit streams in the
sum on the encoder's output, can be changed "on the fly".

Each CELP Extension Encoder uses results of previous layer's encoding
and estimates additional excitation by the "analysis-by-synthesis"
method on the "subframe-by-subframe" base (Figure 5). There are total 3
CELP Extension Encoders.

4.4. Scalable CELP-based decoder
The decoder dequantizes parameters of each encoding layer, reconstructs
total excitation by sum of adaptive codebook and fixed codebooks (core
and enhancement) and synthesizes speech using LPC-filter. Reconstructed
speech is post-filtered and output to the 160 samples buffer (20 ms at
8 kHz). In Figure 6 the structure of CELP-based decoder is presented.







Sviridenko, et al.     Expires February 09, 2012              [Page 13]


Internet-Draft             IPMR Speech Codec                August 2011

                                                            |
                                                       LSF indices
                                                            |
                                                           \|/
-Acbk gain--------------+                            +------+------+
                       \|/                           |     LPC     |
        +----------+   +++                           | Dequantizer |
-Pitch->| Adaptive |-->+X+-----------+               +------+------+
        | Codebook |   +-+           |                      |
        +----------+                 |                    QLPC
                                     |                      |
-Fcbk 1 gain-------------------+     |                     \|/
                              \|/    |               +------+------+
---Pulse      +------------+  +++   \|/              |LPC Synthesis|
information-->+    Fixed   |->|X+-->(+)--Excitation->+    Filter   |
              | Codebook 1 |  +-+   /|\              +------+------+
              +------------+         |                      |
                     .               |                      |
                     .               |                     \|/
                     .               |               +------+------+
               +------------+        |               | Post Filter |
-Pulse         |  Fixed     |  +-+   |               +------+------+
Information n->+ Copybook n +->+X+->-+                      |
               +------------+  +++                      Synthesized
                               /|\                     Speech 8 kHz
                                |                           |
--Fcbk 2 gain-------------------+                          \|/



     Figure 6 Scalable CELP-based Decoder

Decoder has ability to conceal of the lost frames (PLC-like function)
by partial reconstruction of speech, using speech parameters of the
last received frames. However, to provide highest robustness to packet
loss, classes of the most significant parameters only should be
protected.

4.5. Scalable MDCT-based encoder

Scalable MDCT-based encoder operates on a frame basis in a domain of
MDCT spectrum. Quantized spectrum samples are written into the
bitstream.

                +------+   +-----------+  +-----------+
--Input signal->+ MDCT +-->+ Quantizer +->+ Bitstream +--Scalable
                +------+   +-----------+  | formatter |  bitstream-->
                                          +-----------+

                    Figure 7 Scalable MDCT-based Encoder



Sviridenko, et al.     Expires February 09, 2012              [Page 14]


Internet-Draft             IPMR Speech Codec                August 2011

This approach is widely used in modern audio coding algorithms. The
main advantage of developed compression scheme is a bitstream formatter
unit. It constructs stream in a way that any initial part of the
compressed data can be decoded and used for reconstruction. In other
words, each initial part of compressed frame carries self-sufficient
information about band-limited signal with a given level of accuracy.

The bitstream formatter unit operates on a band basis, each eight
samples long. Coding loop iterates over all bands and transmits update
for a given band. Loop ends if all spectrum bands are fully
transmitted.

  +-----------+
 / Spectrum  /
+-----+-----+
      |
     \|/
+-----+------+              +-----------------+
|    Start   +------------>/ numCodedBands=0 /
+-------+----+            +-----------------+
        |
       \|/
   +----+-------------+ no  +------------------+ yes +-----+
+->| chooseCodedBand()+---->+ isAllBandsCoded()+---->+ End |
|  +----+-------------+     +----+-------------+     +-----+
|    yes|                        |no
|      \|/                      \|/
| +-----+-------+   +------------+--+    +-----------------+
| | updateBand()+<--+ startNewBand()+--->+ numCodedBands++ |
| +-----+-------+   +----+----------+    +-----------------+
|       |                .
|       +................+
|       |
|      \|/
| +-----+-------------------+
| | applyCompressionModel() |
| +--------+----------------+
|          |
|         \|/
|  +-------+-----+          +--------------+
+->+ rangeCodec()+--------->+  bits/sample |
   +-----+-------+          +--------------+
        \|/
   +-----+------------+
   | Compressed frame |
   +------------------+

        Figure 8 Spectrum encoding loop





Sviridenko, et al.     Expires February 09, 2012              [Page 15]


Internet-Draft             IPMR Speech Codec                August 2011

Bandwidth expansion (coding band increment) is based on actual
bit/samples ratio known for both encoder and decoder. Coding band
increment only occurs if compression rate exceed some fixed
threshold or all available bands are already fully encoded.
Practical experiments show that if compression ratio exceeds
1.7 - 2 bits/sample than it is reasonable to expand bandwidth
rather than update existing bands.

Band update procedure is based on a bit-planes data representation.
One bit-plane issues per band at time. In terms of binary planes it
means that each update carries one bit of mantissa for each band
sample. Current implementation uses ternary planes instead of
conventional binary planes. This allows encoder to reduce the amount
of noise introduced if only top plane is transmitted.

The sign and sample presence flag together form a top plane for
particular band which transmitted first than on band coding start.
Encoder keeps a track of transmitted planes for each band and chooses
the highest non transmitted plane to update.

Encoder applies different statistic models and compression schemes for
different planes and bands. Actually only several top planes (following
by sign/flag plane) are well suited for compression, whereas all others
tend to have random distribution and in fact can't be compressed at
all. After compression scheme is applied, raw data and chosen statistic
model go to range codec(1)  which writes it into a bitstream.

4.6. Scalable MDCT-based decoder

Decoder performs all the same operations as encoder does, but in
backward manner. First bitstream reader reconstructs quantized spectrum
samples from compressed frame, than inverse quantized reconstructs MDCT
spectrum and inverse MDCT transforms signal back from frequency to time
domain.

            +-----------+   +-----------+   +---------+
  Scalable  | Bitstream +-->+  Inverse  |   | Inverse +--Reconstructed
-bitstream->+  reader   |   | Quantizer +-->+   MDCT  |     signal  -->
            +-----------+   +-----------+   +---------+

        Figure 9 Scalable MDCT-based Decoder




(1) Range codec is a sort of arithmetic codec providing byte stream
    granularity.





Sviridenko, et al.     Expires February 09, 2012              [Page 16]


Internet-Draft             IPMR Speech Codec                August 2011

The resulting signal accuracy and bandwidth dependent on the amount of
available input data. Codec introduces no inter frame data dependency
except 50% time domain overlapping required for MDCT transform. In
practice, it means that signal can't be correctly reconstructed from a
first successfully received compressed frame, but the second frame will
be reconstructed correctly.

The bitstream reader decompress input stream using inverse range coder.
Because of encoder and decoder operate synchronously, each time decoder
runs inverse range codec it uses exactly the same context as were used
by encoder during compression. Stream parsing ends if no more data
available for compressed frame. The following figure demonstrates
spectrum decoding loop.







































Sviridenko, et al.     Expires February 09, 2012              [Page 17]


Internet-Draft             IPMR Speech Codec                August 2011

+------------------+
| Compressed frame |
+---+--------------+
    |
   \|/
 +--+----+          +-----------------+
 | Start +-------> / numCodedBands=0 /
 +---+---+        +-----------------+
     |
    \|/
 +---+---------------+  no           +-----+
 | isDataAvailablle()+-------------->+ End |
 +----+--------------+               +-----+
   yes|
     \|/
 +----+----------------+ no +---------------------+     +-----+
 | chooseDecodedBand() +--->+ isAllBandsDecoded() +---->+ End |
 +---+-----------------+    +-----------+---------+     +-----+
  yes|                                  | no
     +----------------------------------+
     |
    \|/
 +---+----------+                +-------------+
 | rangeCodec() +-------------->/ bits/sample /
 |  (inverse)   |              +-------------+
 +----+---------+
      |
     \|/
 +----+-------------------+
 | applyCompressionMode() |
 |       (inverse)        |
 +-----+------------------+
       |
       +.........................+
      \|/                       \|/
 +-----+--------+     +----------+-----+    +-----------------+
 | updateBand() |     | startNewBand() +-->/ numCodedBands++ /
 | (inverse)    |     |   (inverse)    |  +-----------------+
 +--------+-----+     +------+---------+
          |                  |
         \|/                \|/
   +------+------------------+--------+
  /               Spectrum           /
 +----------------------------------+

     Figure 10 Spectrum decoding loop

In spite of codec has no lower bitrate limit, the compression scheme
used provides artificial reconstructed signal if transmission rate is
low than 16-24 kbps. For low bitrates presented audio codec is used in
a bunch with speech codec and processes the speech codec residue.

Sviridenko, et al.     Expires February 09, 2012              [Page 18]


Internet-Draft             IPMR Speech Codec                August 2011


5.  Security Considerations

   To Be Defined.














































Sviridenko, et al.     Expires February 09, 2012              [Page 19]


Internet-Draft             IPMR Speech Codec                August 2011



6.  Informative References

   [SILK] SILK Speech Codec Draft, https://developer.skype.com/silk?
          action=AttachFile&do=get&target=draft-vos-silk-00.txt













































Sviridenko, et al.     Expires February 09, 2012              [Page 20]


Internet-Draft             IPMR Speech Codec                August 2011

7. IANA Considerarions

   This document has no actions for IANA







































Sviridenko, et al.     Expires February 09, 2012              [Page 21]


Internet-Draft             IPMR Speech Codec                August 2011

Authors' Addresses

   Vladimir Sviridenko
   SPIRIT DSP
   Solzhenitsina 27
   Moscow  109004
   Russia

   Phone: +7 495 661 2178
   Email: vladimirs@spiritdsp.com

   Sergey Ikonin
   SPIRIT DSP
   Solzhenitsina 27
   Moscow  109004
   Russia

   Phone: +7 495 661 2178
   Email: s.ikonin@gmail.com

   Dmitry Yudin
   SPIRIT DSP
   Solzhenitsina 27
   Moscow  109004
   Russia

   Phone: +7 495 661 2178
   Email: yudin@spiritdsp.com


Person & email address to contact for further information:
   Yury Morzeev
   morzeev@spiritdsp.com

















Sviridenko, et al.     Expires February 09, 2012              [Page 22]