CODEC                                                         C. Hoene
Internet Draft                                  Universitaet Tuebingen
Intended status: Informational                            June 3, 2011
Expires: December 2011


        Measuring the Quality of an Internet Interactive Audio Codec
                     draft-hoene-codec-quality-01.txt


Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on June 3, 2011.

Copyright Notice

   Copyright (c) 2011 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document.







Hoene                 Expires December 3, 2011                [Page 1]


Internet-Draft              Codec Quality                    June 2011


Abstract

   The quality of a codec has to be measured by multiple parameters
   such as audio quality, speech quality, algorithmic efficiency,
   latency, coding rates and their respective tradeoffs. During
   standardization, codecs are tested and evaluated multiple times to
   ensure a high quality outcome.

   As the upcoming Internet codec is likely to have unique features,
   there is a need to develop new quality testing procedures to measure
   these features. Thus, this draft reviews existing methods on how to
   measure a codec's qualities, proposes a couple of new methods, and
   gives suggestions which may be used for testing the Internet
   Interactive Audio Codec (IIAC).

   This document is work in progress.

Conventions used in this document

   In this document, equations are written in Latex syntax. An equation
   starts with a dollar sign and ends with a dollar sign. The text in
   between is an equation following the notation of Latex Version 2e.
   In the PDF version of this document, as a courtesy to its readers,
   all Latex equations are already rendered.

Table of Contents

   Conventions used in this document ............................... 2
   1. Introduction ................................................. 4
   2. Optimization Goal ............................................ 6
   3. Measuring Speech and Audio Quality ........................... 7
      3.1. Formal Subjective Tests ................................. 7
         3.1.1. ITU-R Recommendation BS.1116-1 ..................... 7
         3.1.2. ITU-R Recommendation BS.1534-1 (MUSHRA) ............ 8
         3.1.3. ITU-T Recommendation P.800 ......................... 8
         3.1.4. ITU-T Recommendation P.805 ......................... 8
         3.1.5. ITU-T Recommendation P.880 ......................... 9
         3.1.6. Formal Methods Used for Codec Testing at the ITU ... 9
      3.2. Informal Subjective Tests ............................... 9
      3.3. Interview and Survey Tests .............................. 9
      3.4. Web-based Testing ...................................... 10
      3.5. Call Length and Conversational Quality ................. 10
      3.6. Field Studies .......................................... 12
      3.7. Objective Tests......................................... 13
         3.7.1. ITU-R Recommendation BS.1387-1 .................... 14
         3.7.2. ITU-T Recommendation P.862 ........................ 14
         3.7.3. ITU-T Draft P.OLQA ................................ 15


Hoene                 Expires December 3, 2011                [Page 2]


Internet-Draft              Codec Quality                    June 2011


   4. Measuring Complexity ........................................ 15
      4.1. ITU-T Approaches to Measuring Algorithmic Efficiency ... 15
      4.2. Software Profiling ..................................... 17
      4.3. Cycle Accurate Simulation .............................. 18
      4.4. Typical run time environments .......................... 19
   5. Measuring Latency ........................................... 19
      5.1. ITU-T Recommendation G.114 ............................. 20
      5.2. Discussion ............................................. 20
   6. Measuring Bit and Frame Rates ............................... 21
   7. Codec Testing Procedures Used by Other SDOs ................. 22
      7.1. ITU-T Recommendation P.830 ............................. 22
      7.2. Testing procedure for the ITU-T G.719 .................. 24
   8. Transmission Channel ........................................ 25
      8.1. ITU-T G.1050: Network Model for Evaluating Multimedia
      Transmission Performance over IP (11/2007) .................. 26
      8.2. Draft G.1050 / TIA-921B ................................ 27
      8.3. Delay and Throughput Distributions on the Global Internet27
      8.4. Transmission Variability on the Internet ............... 30
      8.5. The Effects of Transport Protocols ..................... 30
      8.6. The Effect of Jitter Buffers and FEC ................... 33
      8.7. Discussion ............................................. 33
   9. Usage Scenarios ............................................. 34
      9.1. Point-to-point Calls (VoIP) ............................ 34
      9.2. High Quality Interactive Audio Transmissions (AoIP) .... 35
      9.3. High Quality Teleconferencing .......................... 35
      9.4. Interconnecting to Legacy PSTN and VoIP (Convergence) .. 36
      9.5. Music streaming......................................... 36
      9.6. Ensemble Performances over a Network ................... 36
      9.7. Push-to-talk like Services (PTT) ....................... 37
      9.8. Discussion ............................................. 38
   10. Recommendations for Testing the IIAC ....................... 38
      10.1. During Codec Development .............................. 38
      10.2. Characterization Phase ................................ 39
         10.2.1. Methodology ...................................... 39
         10.2.2. Material ......................................... 39
         10.2.3. Listening Laboratory ............................. 40
         10.2.4. Degradation Factors .............................. 40
      10.3. Application Developers ................................ 41
      10.4. Codec Implementers .................................... 42
      10.5. End Users ............................................. 42
   11. Security Considerations .................................... 42
   12. IANA Considerations......................................... 42
   13. References ................................................. 43
      13.1. Normative References .................................. 43
      13.2. Informative References ................................ 43
   14. Acknowledgments ............................................ 48



Hoene                 Expires December 3, 2011                [Page 3]


Internet-Draft              Codec Quality                    June 2011


1. Introduction

   The IETF Working Group CODEC is standardizing an Internet
   Interactive Audio and Speech Codec (IIAC). If the codec shall be of
   high quality it is important to measure the codec's quality
   throughout the entire process of development, standardization, and
   usage. Thus, this document supports the standardizing process by
   providing an overview of quality metrics, quality assessment
   procedures, and other quality control issues and gives suggestions
   on how to test the IIAC.

   Quality must be measured by the following stakeholders and in the
   following phases of the codec's development:

   o Codec developers must decide on different algorithms or parameter
      sets during the development and enhancement of a codec. These
      might also include the selection among multiple codec candidates
      that implement different algorithms; however the WG Codec base
      its work on a common consensus not on a competitive selection of
      one of multiple codec contributions. Thus, measuring the quality
      of codecs to select one might not be required.
      Besides selection, one is obliged to debug the codec software. To
      find errors and bugs - and programming mistakes are present in
      any complex software - the developer has to test this software by
      conducting quality measurements.

   o Typically the codec standardization includes a qualification
      phase that measures the performance of a codec and verifies
      whether it confirms to predefined quality requirements. In the
      qualification phase, it becomes obvious whether the codec
      development and standardization has been successful. Again, in
      the process of rigorous testing during qualification phase,
      algorithmic weaknesses and bugs in the implementation may be
      found. Still, in complex software such as the IIAC, correctness
      cannot be proved or guaranteed.













Hoene                 Expires December 3, 2011                [Page 4]


Internet-Draft              Codec Quality                    June 2011


   o Users of the codec need to know how well the codec is performing
      while manufactures need to decide whether to include the IIAC in
      their products. Quality measures play an important role in this
      decision process. Also, the numerous quality measurement results
      of the quality help developers of the VoIP system to dimension or
      tune their system to take optimal advantage of a codec. For
      example, during network planning, operators can predict the
      amount of bandwidth needed for high quality voice calls.
      An adaptive VoIP application needs to know which quality is
      achieved with a different codec parameters set to be able to make
      an optimal selection of the codec parameters under varying
      network conditions.
      As suggested in [50] an RTP payload specification for an IIAC
      codec should include a rate control. Similar to the performance
      of the codec, the rate control unit has a big impact on the
      overall quality of experience. Thus, it should be tested well
      too.

   o Software implementers need to verify whether their particular
      codec implementation that might be optimized on a specific
      platform confirms to the standard's reference implementation.
      This is particularly important as some intellectual property
      rights might only be granted, if the codec conforms to the
      standard.
      As the IIAC must not to be bit conform, which would allow simple
      comparisons of correctness, other means of conformance testing
      must be applied.
      In addition, the standard conformance and interoperability of
      multiple implementations must be checked.
      Last but not least, implementers may implement optimized
      concealment algorithms, jitter buffers or other algorithms. Those
      algorithms have to be tested, too.

   o Since the success of MP3, end users do acknowledge the existence
      of a high quality codec. It would make sense to use the IIAC in a
      brand marketing campaign (such as "Intel inside"). A quality
      comparison between IIAC and other codecs might be part of the
      marketing. Online testing with user participation might also
      raise the awareness level.

   All those stakeholders might have different requirements regarding
   the codec's quality testing procedures. Thus, this document tries to
   identify those requirements and shows which of the existing quality
   measurement procedures can be applied to fulfill those specific
   demands efficiently.




Hoene                 Expires December 3, 2011                [Page 5]


Internet-Draft              Codec Quality                    June 2011


   In the following section we describe a primary optimization goal:
   Quality of Experience (QoE). Next, we briefly list the most common
   methods of how to perform subjective evaluations on speech and audio
   quality. In Section 4, 5, and 6, we discuss on how to measure
   complexity, latency, and bit- and frame rates. Section 7 describes
   how other SDOs have measured the quality of their codecs. As
   compared IIAC to previous standardized codecs, the IIAC is likely to
   have different unique requirements and thus needs newly developed
   quality testing procedures. To achieve this, in Section 8 we
   describe the properties of Internet transmission paths. Section 9
   summarizes the usage scenarios, for which the codec is going to be
   used and finally, in Section 10, we recommend procedures on how to
   test the IIAC.

2. Optimization Goal

   The aim of the Codec WG is to produce a codec of high quality.
   However, how can quality be measured? The measurement of the
   features of a codec can be based on many different criteria. Those
   include complexity, memory consumption, audio quality, speech
   quality, and others. But in the end, it's the users' opinions that
   really count since they are the customers. Thus, one important - if
   not the most important quality measure of the IIAC - shall be the
   Quality of Experience (QoE).

   The ITU-T Standards ITU-T P.10/G.100 [22] defines the term "Quality
   of Experience" as "the overall acceptability of an application or
   service, as perceived subjectively by the end-user." The ITU-T
   document G.RQAM [21] extends this definition by noting that "quality
   of experience includes the complete end-to-end system effects
   (client, terminal, network, services infrastructure, etc.)" and that
   the "overall acceptability may be influenced by user expectations
   and context".

   These definitions already give guidelines on how to judge the
   quality of the IIAC:

   o The acceptability and the subjective quality impression of
      endusers have to be measured (Section 3).

   o The IIAC codec has to be tested as part of an entire
      telecommunication system. It must be carefully considered whether
      to measure the codec's performance just in a stand-alone setup or
      to evaluate it as part of the overall system (Section 8).





Hoene                 Expires December 3, 2011                [Page 6]


Internet-Draft              Codec Quality                    June 2011


   o The environments and contexts of particular communication
      scenarios have to be considered and controlled because they have
      an impact on the human rating behavior and on quality
      expectations and requirements (Section 9).

3. Measuring Speech and Audio Quality

   The perceived quality of a service can be measured by various means.
   If humans are interrogated, those quality tests are called
   subjective. If the tests are conducted by instrumental means (such
   as an algorithm) they are called objective. Subjective tests are
   divided up into formal and informal tests. Formal tests follow
   strictly defined procedures and methods and typically include a
   large number of subjects. Informal tests are less precise because
   they are conducted in an uncontrolled manner.

3.1. Formal Subjective Tests

   Formal subjective tests must follow a well-defined procedure.
   Otherwise the results of multiple tests cannot be mutually compared
   and are not repeatable. Most subjective testing procedures have been
   standardized by the ITU.  If applied to coding testing, the testing
   procedures follow the same pattern [26]:

     "Performing subjective evaluations of digital codecs proceeds
     via a number of steps:

        o Preparation of source speech materials, including recording of
          talkers;

        o Selection of experimental parameters to exercise the features
          of the codec that are of interest;

        o Design of the experiment;

        o Selection of a test procedure and conduct of the experiment;

        o Analysis of results."

   The ITU has standardized different formal subjective tests to
   measure the quality of speech and audio transmission, which are
   described in the following.

3.1.1. ITU-R Recommendation BS.1116-1

   The ITU-R BS.1116-1 standard [14] is good for audio items with small
   degradations (stimuli) and uses a continuous scale from


Hoene                 Expires December 3, 2011                [Page 7]


Internet-Draft              Codec Quality                    June 2011


   imperceptible (5.0) to very annoying (1.0). It is a double blind
   triple-stimulus with a hidden reference testing method and must be
   done twice for the degraded sample and the hidden reference. In a 30
   minutes session, 10-15 sample items can be judged. Overall, about 20
   subjects shall rate the items. Testing shall take place with
   loudspeakers in a controlled environment or with headphones in a
   quiet room.

3.1.2. ITU-R Recommendation BS.1534-1 (MUSHRA)

   The ITU-R BS.1534-1 standard [16] defines a method for the
   subjective assessment of intermediate quality levels. Multiple audio
   stimuli are compared at the same time. Maximal 12 but preferably
   only 8 stimuli plus a hidden one with Hidden Reference and an anchor
   are compared and judged. MUSHRA uses a continuous quality scale
   (CQS) ranging from 0 to 100 divided into five equal intervals ("bad"
   to "excellent"). In 30 minutes, about 42 stimuli can be tested.
   Again, 20 test subjects shall rate the items with either headphones
   or loudspeakers.

   The standard recommends using as lower anchor a low-pass filtered
   version with a bandwidth limit of 3.5 kHz. Additional anchors are
   recommended, especially if specific distortions are to be tested.

3.1.3. ITU-T Recommendation P.800

   The ITU-T P.800 defines multiple testing procedures to assess the
   speech quality of telephone connections. The most important
   procedure is called listening-only speech quality of telephone
   connections. Listeners rate short groups of unrelated sentences. The
   listeners are taken from the normal telephone-using population (no
   experts). They use a typical sending system (e.g. a local telephone)
   that may follow "modified IRS" frequency characteristics. The
   results is the listening-quality scale, which is an absolute
   category scale (ACS) ranging from excellent=5 to bad=1. Listeners
   can judge about 54 stimuli within 30 minutes.

   Other tests described in P.800 measure listening-effort, loudness-
   preference scale, conversation opinion and difficulty,
   delectability, degradation, or minimal differences.

3.1.4. ITU-T Recommendation P.805

   The P.805 standard [24] extends P.800 and defines precisely how to
   measure conversational quality. Subjects have to do conversation
   tests to evaluate the communication quality of a connected. Expert,
   experienced or untrained (naive) subjects have to do these tests


Hoene                 Expires December 3, 2011                [Page 8]


Internet-Draft              Codec Quality                    June 2011


   collaboratively in soundproof cabinets. Typically, 6 transmission
   conditions can be tested within 30 minutes. Depending on the
   required precision, these tests have to be made 20 to 40 times.

3.1.5. ITU-T Recommendation P.880

   To measure time-variable distortion, a continuous evaluation of
   speech quality has been defined in P.880 [31]. Subjects have to
   assess transmitted speech quality consisting of long speech
   sequences with quality/time fluctuations. The quality is rated on a
   continuous scale ranging from Excellent=5 to Bad=1 is dynamically
   changed over the time while the stimuli are played. Stimuli have a
   length of between 45 seconds and 3 minutes.

3.1.6. Formal Methods Used for Codec Testing at the ITU

   In the last year, new narrow and wideband codecs have been tested
   using ITU-T P.800 (and ITU-T P.830). For the ITU-T G.719 standard,
   which supports besides speech content also audio, the ITU-R BS.1116-
   1 testing method has been applied during the selection of potential
   codec candidates. During the qualification phase, the method that
   was used was the ITU-P BS.1584-1. For the ITU-T G.718 codec, the
   Absolute Category Rating (ACR) following ITU-T P.800 has been
   applied.

3.2. Informal Subjective Tests

   Besides formal tests, informal subjective tests following less
   stringent conditions might be taken to judge the quality of stimuli.
   However, informal tests cannot be easily verified and lack the
   reliability, accuracy and precision of formal tests. Informal tests
   are needed if the available number of subjects who are able to
   conduct the tests is low, or if time or money is limited.

3.3. Interview and Survey Tests

   In ITU-T P.800 [23] and [9] interview and survey tests are
   described. In P.800, it says that "if the rather large amount of
   effort needed is available and the importance of the study warrants
   it, transmission quality can be determined by 'service
   observations'."

   These service observations are based on statistical surveys common
   in social science and marketing research. Typically, the questions
   asked in a survey are structured.




Hoene                 Expires December 3, 2011                [Page 9]


Internet-Draft              Codec Quality                    June 2011


   In addition, according to [23]: "To maintain a high degree of
   precision a total of at least 100 interviews per condition is
   required. A disadvantage of the service-observation method for many
   purposes is that little control is possible over the detailed
   characteristics of the telephone connections being tested."

3.4. Web-based Testing

   If the large-wide scale proliferation of the Internet, researchers
   suggested testing the speech or audio quality on web sites via web
   site visitors [43]. A current web site that compares multiple audio
   codecs has been setup at SoundExpert.org [42]. On this web site, a
   user can download an audio item that consists of a reference item
   and a degraded item. Then, the user must identify the reference and
   rate the ODG of the degraded item. The tests are single-blind as the
   user does not know which codec he is currently rating.

   One can anticipate that the visitors of web sites will use similar
   equipment for testing of audio samples and for conducting VoIP
   calls. Thus, web site testing can be made realistic in a way that
   considers the impact of (typically used) loudspeakers and
   headphones.

   However, currently used web sites lack a proper identification of
   outliers. Thus, all ratings of all users are considered despite the
   fact that they might be (deliberately) faked or that subjects might
   not be able to hear well the acoustic difference. Thus, one can
   expect that web based ratings will show a high degree of variation
   and that many more tests are needed to achieve the same confidence
   that is gained within formal tests. A profound scientific study on
   the quality of web based audio rating has not yet been published.
   Thus, any statements on the validity of web based rating are
   premature.

3.5. Call Length and Conversational Quality

   In the ETSI technical report document ETR-250 [6], a model is
   presented that discusses various impairments caused in narrow band
   telephone systems. The ETSI model describes the combinatorial effect
   of all those impairments. The ETSI model later became the famous E-
   Model described in ITU-T G.107. Both the ETSI- and the E-Model
   calculate the R factor that ranges from 0 (bad) to 100 (excellent
   conversational quality).

   Based on the R factor, the users' reaction to the voice transmission
   quality of a connection can be predicted. For example, Section 8.3
   describes the effect that users terminate the call if the quality is


Hoene                 Expires December 3, 2011               [Page 10]


Internet-Draft              Codec Quality                    June 2011


   bad. More precisely, they summarize it as users who "(i) terminate
   their calls unusually early, (ii) re-dial or even (iii) actually
   complain to the network operator".

   In the ETSI model, the percentage of users "terminating calls
   early", TME, is given as

     $TME=100\cdot erf\left(\frac{36-R}{16}\right)\%$

   with $erf(X)$ being the sigmoid shaped Gaussian error function and
   $R$ the R Factor of the E-Model (Figure 1). This relation is based
   on results from "AT&T Long toll" interviews as cited in [2].

   These findings have been confirmed by Holub et al. [12] who have
   studied the correlation between call length and narrow band speech
   quality. Birke et al. [1] have also studied the duration of phone
   calls which show a duration varying with day time and day of the
   week and also may be affected by pricing schemata.






























Hoene                 Expires December 3, 2011               [Page 11]


Internet-Draft              Codec Quality                    June 2011


             100 -+TME.                               +- 5
                  |..iii.                             |
        T         |    .ii                            |
        e         |      ii                        MOS|
        r         |       i.                     .iiii|
        m     80 -+       .i.                  .ii.   |
        i         |        .i                .ii.     +- 4
        n         |         i.              .i.       |      M
        a         |         .i            .ii.        |      O
        t         |          i.          .i.          |      S
        e     60 -+          .i         .i.           |      |
                  |           i.       ii.            |      C
        E         |           .i     .ii              +- 3   Q
        a         |            i.   .i.               |      E
        r     40 -+            .i  .i.                |
        l         |             i..i.                 |
        y         |             .ii.                  |
                  |            .il.                   |
        (         |           .i..i                   +- 2
        T     20 -+          .i.  i.                  |
        M         |        .ii.   .i.                 |
        E         |      .ii.      .i.                |
        )         |    .ii.         .ii.              |
                  |MOSlii.           .iiiiiiiiiiiiiTME|
               0 -+-----------------+-----------------+- 1
                  |                 |                 |
                  0                 50               100

                                R Factor

     Figure 1 - Relation between calls terminating early, the R Factor,
                 and the speech quality given in (MOS-CQE)

   Whereas bad quality is related to short calls, it remains unproven
   whether better quality (>4 MOS) results in longer phone calls. There
   are two factors which might have an opposite effect on the call
   length. On the one hand, if the quality is superb, the talkers might
   be more willing to talk because of the pleasure of talking, on the
   other hand they might fulfill their conversational tasks faster
   because of the great quality Thus, depending on the context, good
   speech quality might result either in longer or shorter calls.

3.6. Field Studies

   Field studies can be conducted if usage data on calls are collected.
   Field studies are useful to monitor real user behavior and to
   collect data about the actual conversational context.


Hoene                 Expires December 3, 2011               [Page 12]


Internet-Draft              Codec Quality                    June 2011


   Because of highly varying conditions, the precision of those
   measurements is high and many tests have to be done to get
   significantly different measurement values. Also, the tests are not
   repeatable because the conditions are changing with time.

   For example, Skype has done quality tests in a deployed VoIP system
   in the field with its users as testers [47]. The subjective tests
   are done in the following manner.

   o Download of test vectors to VoIP clients. Typically, this can be
      done with an automated software update.

   o Delivery changing VoIP configurations (such as the used codecs)
      so that different calls are subjected to different
      configurations. The selection of configurations can be done
      randomly, alternating in time or based on other criteria.

   o Collecting feedback from the users. For example, the following
      parameters can be monitored or recorded:

        o The call length and other call specific parameters

        o A user's quality voting (e.g. MOS-ACR) after the call

        o Other feedback of the user (e.g. via support channels)

   The field tests have the benefit of being conducted under real
   conditions with the real users. However, they have some drawbacks.
   First, the experimental conditions cannot be controlled well.
   Second, the tests are only valid for the current situations and do
   not allow predictions for other use cases. Third, the statistical
   significance might be largely questionable if confidence intervals
   are overlapping.

   The costs for running the tests are low because the users are doing
   the tests for free. However, the operator might lose users after a
   user experienced a test case causing bad quality.

3.7. Objective Tests

   Objective tests, also called instrumental tests, try to predict the
   human rating behavior with mathematical models and algorithms. They
   also calculate quality ratings for a given set of audio items.
   Naturally, they are not rating as precisely as their human
   counterparts, whom they try to simulate. However, the results are
   repeatable and less costly than formal subjective testing campaigns.
   Instrumental methods have a limited precision. That means that their


Hoene                 Expires December 3, 2011               [Page 13]


Internet-Draft              Codec Quality                    June 2011


   quality ratings do not perfectly match the results of formal
   listening-only tests. Typically, the correlation between formal
   results and instrumental calculations are compared using a
   correlation function. The resulting metric is given as R ranging
   from 0 (no correlation) to 1 (perfect match).

   Over the last years, several objective evaluation algorithms have
   been developed and standardized. We describe them briefly in the
   following.

3.7.1. ITU-R Recommendation BS.1387-1

   The ITU developed an algorithm that is called Perceptual Evaluation
   of Audio Quality (PEAQ). It was published in the document ITU-R
   BS.1387 called Method for objective measurements of perceived audio
   quality in 1998 [15]. PEAQ is intended to predict the quality rating
   of low-bit-rate coded audio signals. Two different versions of PEAQ
   are provided: a basic version with lower computational complexity
   and an advanced version with higher computational complexity.

   PEAQ calculates a quality grading called "Objective Difference
   Grade" (ODB) ranging from 0 to -4. Typically, it shows a prediction
   quality of between R=0.85 and 0.97 when compared to subjective
   testing results. The ITU-T Study Group 12 assumes that PEAQ can
   detect auditable differences between two implementations of the same
   codec [5].

3.7.2. ITU-T Recommendation P.862

   The ITU-T PESQ algorithm [27] is intended to judge distortions
   caused by narrow band speech codecs and other kind of channel and
   transmission errors. These include also variable delays, filtering
   and short localize distortions such as those caused by frame loss
   concealment. For a large number of conditions, the validity and
   precision of PESQ has been proven. For untested distortions, prior
   subjective tests must be conducted to verify whether PESQ judges
   these kinds of distortions precisely. Also, it is recommended to use
   PESQ for 3.1 kHz (narrow-band) handset telephony and narrow-band
   speech codecs only. For wide-band operations, a modified filter has
   to be applied prior to the tests.

   Furthermore, the ITU-T Recommendation P.862.1 [28] describes how to
   transfer the PESQ's raw scores, which range from -0.5 to 4.5, to
   MOS-LQO values similar to those gathered from ACR ratings. Then, as
   it has been shown, the correlation between a large corpus of testing
   samplings shows a correlation of R=0.879 (instead of R=0.876)
   between subjective and MOS-LQO (respective PESQ raw) ratings. The


Hoene                 Expires December 3, 2011               [Page 14]


Internet-Draft              Codec Quality                    June 2011


   ITU-T Recommendation P.862.2 [29] modifies the PESQ algorithm
   slightly to support wideband operations. And finally, the ITU-T
   Recommendation P.862.3 [30] gives detailed hints and recommendations
   on how and when to use the PESQ algorithms.

3.7.3. ITU-T Draft P.OLQA

   The soon-to-be standardized algorithm P.OLQA [40] extends PESQ and
   will be able to rate narrow to super-wideband speech and the effect
   of time-varying speech playout. Later distortions are common in
   modern VoIP systems which stretch and shrink the speech playout
   during voice activity to adapt it to the delay process of the
   network.

4. Measuring Complexity

   Besides audio and speech quality, the complexity of a codec is of
   prime importance. Knowing the algorithmic efficiency is important
   because:

   .  the complexity has an impact on power consumption and system
      costs

   .  the hardware can be selected to fit pre-known complexity
      requirements and

   .  different codec proposals can be compared if they show similar
      performances in other aspects.

   Before any complexity comparisons can be made, one has to agree on
   an objective, precise, reliable, and repeatable metric on how to
   measure the algorithmic efficiency. In the following, we list three
   different approaches.

4.1. ITU-T Approaches to Measuring Algorithmic Efficiency

   Over the last 17 years, the ITU-T Study Group 16 measured the
   complexity of codecs using a library called ITU-T Basic Operators
   and described in ITU-T G.191 [19], which counts the kind and number
   of operations and the amount of memory used. The latest version of
   the standard supports both fix-point operations of different widths
   and floating operations.  Each operation can be counted
   automatically and weighted accordingly. The following source code is
   an [edited] excerpt from the source file baseop32.h:





Hoene                 Expires December 3, 2011               [Page 15]


Internet-Draft              Codec Quality                    June 2011


   /* Prototypes for basic arithmetic operators */

   /* Short add,           1   */
   Word16 add (Word16 var1, Word16 var2);

   /* Short sub,           1   */
   Word16 sub (Word16 var1, Word16 var2);

   /* Short abs,           1   */
   Word16 abs_s (Word16 var1);

   /* Short shift left,    1   */
   Word16 shl (Word16 var1, Word16 var2);

   /* Short shift right,   1   */
   Word16 shr (Word16 var1, Word16 var2);

   ...

   /* Short division,       18  */
   Word16 div_s (Word16 var1, Word16 var2);

   /* Long norm,             1  */
   Word16 norm_l (Word32 L_var1);

   In the upcoming ITU-T G.GSAD standard another approach has been used
   as shown in the following code example. For each operation, WMPOS
   functions have been added, which count the number of operations. If
   the efficiency of an algorithm has to be measured, the program is
   started and the operations are counted for a known input length.

   for (i=0; i<NUM_BAND; i++)
   {
   #ifdef WMOPS_FX
       move32();move32();
       move32();move32();
   #endif
       state_fx->band_enrg_long_fx[i] = 30;
       state_fx->band_enrg_fx[i] = 30;
       state_fx->band_enrg_bgd_fx[i] = 30;
       state_fx->min_band_enrg_fx[i] = 30;
   }







Hoene                 Expires December 3, 2011               [Page 16]


Internet-Draft              Codec Quality                    June 2011


4.2. Software Profiling

   The previously described methods are well-established procedures on
   how to measure computational complexity. Still, they have some
   drawbacks:

   o Existing algorithms must be modified manually to include
      instructions that count arithmetic operations. In complex codecs,
      this may take substantial time.

   o The CPU model is simple as it does not consider memory access
      (e.g. cache), parallel executions, or other kinds of optimization
      that are done in modern microprocessors and compilers. Thus, the
      number of instructions might not correlate to the actual
      execution time on modern CPUs.

   Thus, instead of counting instructions manually, run times of the
   codec can be measured on a real system. In software engineering,
   this is called profiling. The Wikipedia article on profiling [54]
   explains profiling as follows:

     "In software engineering, program profiling, software profiling or
     simply profiling, a form of dynamic program analysis (as opposed
     to static code analysis), is the investigation of a program's
     behavior using information gathered as the program executes. The
     usual purpose of this analysis is to determine which sections of a
     program to optimize - to increase its overall speed, decrease its
     memory requirement or sometimes both.

        o  A (code) profiler is a performance analysis tool that, most
          commonly, measures only the frequency and duration of
          function calls, but there are other specific types of
          profilers (e.g. memory profilers) in addition to more
          comprehensive profilers, capable of gathering extensive
          performance data

        o  An instruction set simulator which is also - by necessity - a
          profiler, can measure the totality of a program's behaviour
          from invocation to termination."

   Thus, a typical profiler such as the GNU gprof can be used to
   measure and understand the complexity of a codec implementation.
   This is precisely the case because it is used on modern computers.
   However, the execution times depend on the CPU architecture, the PC
   in general, the OS and parallel running programs.




Hoene                 Expires December 3, 2011               [Page 17]


Internet-Draft              Codec Quality                    June 2011


   To ensure repeatable results, the execution environment (i.e. the
   computer) must be standardized. Otherwise the results of run times
   cannot be verified by other parties as the results may differ if
   done under slightly changed conditions.

4.3. Cycle Accurate Simulation

   If reliable and repeatable results are needed, another similar
   approach can be chosen. Instead of run times, CPU clock cycles on a
   virtual reference system can be measured. Quoting Wikipedia again
   [52]:

     "A Cycle Accurate Simulator (CAS) is a computer program that
     simulates a microarchitecture cycle-accurate. In contrast
     an instruction set simulator simulates an Instruction Set
     Architecture usually faster but not cycle-accurate to a specific
     implementation of this architecture."

   With a cycle accurate simulator, the execution times are precise and
   repeatable for the system that is being studied. If two parties make
   measurements using different real computers, they still get the same
   results if they use the same CAS.

   A cycle accurate simulator is slower than the real CPU by a factor
   of about 100. Also, it might have a measurement error as compared to
   the simulated, real CPU because the CPU is typically not perfectly
   modeled.

   If an x86-64 architecture shall be simulated, the open-source Cycle
   accurate simulator called PTLsim can be considered [55]. PTLsim
   simulates a Pentium IV. On their website, the authors of PTLsim
   write:

     "PTLsim is a cycle accurate x86 microprocessor simulator and
     virtual machine for the x86 and x86-64 instruction sets. PTLsim
     models a modern superscalar out of order x86-64 compatible
     processor core at a configurable level of detail ranging from
     full-speed native execution on the host CPU all the way down
     to RTL level models of all key pipeline structures."

   Another cycle accurate simulator called FaCSIM simulated the ARM9E-S
   processor core and ARM926EJ-S memory subsystem [36]. It is also
   available as open-source. Texas Instruments also provides as CAS for
   its C64x+ digital signal processor [44].

   To have a metric that is independent of a particular architecture,
   the results of cycle accurate simulators could be combined.


Hoene                 Expires December 3, 2011               [Page 18]


Internet-Draft              Codec Quality                    June 2011


4.4. Typical run time environments

   The IIAC codec will run on various different platforms with quite
   diverse properties. After discussions on the WG mailing list, a few
   typical run time environments have been identified.

   Three of the run time environments are end devices (aka phones). The
   first one is a PC, either stationary or a portable, having a >2 GHz
   PCU, >2 GByte of RAM, and a hard disk for permanent storage.
   Typically, a Windows, MacOS or Linux operating system is running on
   a PC. The second one is a SmartPhone, for example with an ARM11 500
   MHz CPU, 192 Mbyte RAM and 256 MByte Flashrom. An example is the HTC
   Dream Smart phone equipped with Qualcomm MSM7201A chip. Various
   operating systems are found on those devices such as Symbian,
   Android, and iOS. The last ones are high end stationary VoIP phones
   with for example a 275-MHz MIPS32 CPU (with 400 DMIPS) with a 125-
   MHz (250 MIPS) ZSP DSP with dual-MAC. They both have more than 1
   Mbyte RAM and FlashRom. An exemplary Chip is the BCM1103 [3].

   Besides phones, VoIP gateways are frequently needed for conferencing
   or transcoding to legacy VoIP or PSTN. In this case, two different
   platforms have been identified. The first one is based on standard
   PC server platforms. It consists, for example, of an Intel six core
   Xeon 54XX or 55XX, two 1 GB NIC, 12 GByte RAM, hard disks, and a
   Linux operating system. Thus, a server can serve from 400 to 10000
   calls depending on conference mode, codecs used, and ability of user
   pre-encoded audio [46]. On the other hand, high density, highly
   optimized voice gateways use a special purpose hardware platform
   like for example, TNETV3020 chips consisting of six TI C64x+ DSPs
   with 5.5 MB internal RAM. If they run with a Telogy conference
   engine, they might serve about 1300 AMR or 3000 G.711 calls per chip
   [45].

5. Measuring Latency

   Latency is a measure of time delay experienced in a system. Latency
   can be measured as one-way delay or as round-trip time. The latter
   one is the one-way latency from a source to destination plus the
   one-way latency back from destination to source. Latency can be
   measured at multiple positions, at the network layer or at higher
   layers [53].

   As we aim to increase the Quality of Experience, the mouth-to-ear
   delay is of importance because it directly correlates with
   perceptual quality [17]. More precisely, the acoustic round-trip
   time shall be a means of optimization when studying interactive and
   conversational application scenarios.


Hoene                 Expires December 3, 2011               [Page 19]


Internet-Draft              Codec Quality                    June 2011


5.1. ITU-T Recommendation G.114

   The G.114 standard [45] gives guidelines on how to estimate one-way
   transmission delays. It describes how the delay introduced by the
   codec is generated. Because most of the encoders do a processing of
   frames, the duration of a frame (named "frame size") is the foremost
   contributor to the overall algorithmic delay. Citing [18]:

     "In addition, many coders also look into the succeeding frame to
     improve compression efficiency. The length of this advance look is
     known as the look-ahead time of the coder. The time required to
     process an input frame is assumed to be the same as the frame
     length since efficient use of processor resources will be
     accomplished when an encoder/decoder pair (or multiple
     encoder/decoder pairs operating in parallel on multiple input
     streams) fully uses the available processing power (evenly
     distributed in the time domain). Thus, the delay through an
     encoder/decoder pair is normally assumed to be:"

   $2*frameSize + lookAhead$

   In addition, if the link speeds are low, the serialization delay
   might contribute significantly to the codec delay.

   Also, if IP transmissions are used and multiple frames are
   concatenated in one IP packet, further delay is added. Then, "the
   minimum delay attributable to codec-related processing in IP-based
   systems with multiple frames per packet is:"

   $(N+1)*frameSize + lookAhead$

   "where N is the number of frames in each packet."

5.2. Discussion

   Extensive discussion on the WG mailing list led to the insight that
   the afore mentioned ITU delay model overestimates the delay
   introduced by the codec. In the last decade, two developments led to
   slightly other conditions.

   First, the processing power of CPU increased significantly (see
   Section 4.4). Nowadays, even stand-alone VoIPs have CPUs with a
   speed of 300 MHz. They are capable of doing the encoding and
   decoding faster than real time. Thus, also the delay introduced by
   processing is not at 100% anymore but significantly lower. For
   example, it might be just 10% or less.



Hoene                 Expires December 3, 2011               [Page 20]


Internet-Draft              Codec Quality                    June 2011


   Second, even if the CPUs are fully loaded, especially if also other
   tasks such as a video conference or other calls need to be
   processed, advantaged scheduling algorithms allow for a timely
   encoding and decoding. For example, a staggered processing schedule
   can be used to reduce processing delays [45].

   Thus, the impact of processing delay is reduced significantly in
   most of the cases.

   Moreover, besides a look-ahead time, the decoder might also
   contribute to the algorithmic delay e.g. if decoded and concealed
   periods shall be mixed well.

6. Measuring Bit and Frame Rates

   For decades, there was a quest to achieve high quality while keeping
   the coding rate low. Coding rate, sometimes called multimedia bit
   rate, is the bit rate that an encoder produces as its output stream.
   In cases of variable rate encoding, the coding bit rate differs over
   time. Thus, one has to describe the coding rate statistically. For
   example, minimal, mean, and maximal coding rates need to be
   measured.

   A second parameter is the frame rate as the encoder produces frames
   at a given rate.  Again, in case of discontinuous transmission modes
   (DTX), the frame rate can vary and a statistical description is
   required.

   Both coding and frame rate influence network related bit rates. For
   example, the physical layer gross bit rate is the total number of
   physically transferred bits per second over a communication link,
   including useful data as well as protocol overhead [51]. It depends
   on the access technology, the packet rate, and packet sizes. The
   physical layer net bit rate is measured in a similar way but
   excludes the physical layer protocol overhead. The network
   throughput is the maximal throughput of a communication link of an
   access network. Finally, the goodput or data transfer rate refers to
   the net bit rate delivered to an application excluding all protocol
   headers and data link layer retransmissions, etc. Typically, to
   avoid packet losses or queuing delay, the goodput shall be equally
   large as the coding rate.

   The relation between goodput and the physical layer gross bit rate
   is not trivial.  First of all, the goodput is measured end-to-end.
   The end-to-end path can consist of multiple physical links, each
   having a different overhead. Second, the overhead of physical layers
   may vary with time and load, depending for example on link


Hoene                 Expires December 3, 2011               [Page 21]


Internet-Draft              Codec Quality                    June 2011


   utilization and link quality. Third, packets may be tunneled through
   the network and additional headers (such as IPsec) might be added.
   Fourth, IP header compression might be applied (as in LTE networks)
   and the overhead might be reduced. Overall, many information about
   the network connection must be collected to predict what the
   relation between physical layer gross bit rate and a given coding
   and frame rate is going to be. Applications, which have only a
   limited view of the network, can hardly know the precise relation.

   For example, the DCCP TFRC-SP transport protocol simply estimates a
   header size on data packets of 36 bytes (20 bytes for the IPv4
   header and 16 bytes for the DCCP-Data header with 48-bit sequence
   numbers) [7][8]. Thus, [11] suggested a typical scenario in which
   one encoded frame is transmitted with the RTP, UDP, IPv4 and IEEE
   802.3 protocols and thus each packet contains packet headers having
   12 bytes, 8 bytes, 20 bytes and 18 bytes respectively. The gross bit
   rate calculates as

   $r_{gross}=r_{coding}+overhead \cdot framerate$

   where $r_{coding}$ is the coding rate of the encoding, $framerate$
   is the frame rate of the codec, $overhead$ is the number of bits for
   protocol headers in each packet (typically 58*8=464), and the
   $r_{gross}$ is the rate used on physical mediums.

7. Codec Testing Procedures Used by Other SDOs

   To ensure quality, each newly standardized codec is rigorously
   tested. ITU-T Study Group 12 and 16 have developed very good and
   mature procedures on how to test codecs. The ITU-T Study Group 12
   has described the testing procedures of narrow- and wide-band codecs
   in the ITU-T P.830 standard.

7.1. ITU-T Recommendation P.830

   The ITU-T P.830 recommendation describes methods and procedures for
   conducting subjective performance evaluations of digital speech
   codecs. It recommends for most applications the Absolute Category
   Rating (ACR) method using the Listening Quality scale. The process
   of judging the quality of a speech codec consists of five steps,
   which are described in the following.

   Step 1: Preparation of Source Speech Materials Including Recording
   of Talkers. When testing a narrow band codec, the recommendation
   suggests to use a bandwidth filter before applying sample items to a
   codec. This bandwidth filter is called modified Intermediate
   Reference System (IRS) and limits the frequency band to the range


Hoene                 Expires December 3, 2011               [Page 22]


Internet-Draft              Codec Quality                    June 2011


   between 300 and 3400 Hz. In addition, the recommendation states that
   "if a wideband system (100-7000 Hz) is to be used for audio-
   conferencing, then the sending end should conform to IEC Publication
   581.7."

   It also says that "speech material should consist of simple, short,
   meaningful sentences." The sentences shall be understandable to a
   broad audience and sample items should consist of two or three
   sentences, each of them having a duration of between 2 and 3
   seconds. Sample items should not contain noise or reverberations
   longer than 500 ms. The recommendation also makes suggestions on the
   loudness of the signal: "A typical nominal value for mean active
   speech level (measured according to Recommendation P.56) is
   -20 dBm0, corresponding to approximately -26 dBov"

   Step 2: Selection of Experimental Parameters to Exercise the
   Features of the Codec That Are of Interest. Various parameters shall
   be tested. Those include

   o Codec Conditions

        o  Speech input levels ("input levels of 14, 26 and 38 dB below
          the overload point of the codec")

        o  Listening levels ("levels should lie 10 dB to either side of
          the preferred listening level")

        o  Talkers

             . Different talkers ("a minimum of two male and two female
               talkers")

             . Multiple talkers ("multiple simultaneous voice input
               signals")

        o  Errors ("randomly distributed bit errors" or burst-errors)

        o  Bitrates ("The codec must be tested at all the bit rates")

        o  Transcodings ("Asynchronous tandeming", "Synchronous
          tandeming", and "Interoperability with other speech coding
          standards")

        o  Mismatch (sender and receiver operate in different modes)

        o  Environmental noise (sending) ("30 dB for room noise" and "10
          dB and 20 dB for vehicular noise")


Hoene                 Expires December 3, 2011               [Page 23]


Internet-Draft              Codec Quality                    June 2011


        o  Network information signals ("signaling tones, conforming to
          Recommendation Q.35, should be tested subjectively, and the
          minimum should be proceed to dial tone,  called subscriber
          ringing tone, called subscriber engaged tone, equipment
          engaged tone, [and] number unobtainable tone.")

        o  Music ("to ensure that the music is of reasonable quality")

   o Reference conditions ("for making meaningful comparisons")

        o  Direct (no coding, only input and output filtering)

        o  Modulated Noise Reference Unit (MNRU)

        o  Signal-to-Noise Ratio (SNR) (for comparison purposes)

        o  Reference codecs

   Step 3: Design of the Experiment. The considerations described in
   B.3/P.80 apply here. Typically, it is not possible to test each
   combination of parameters. Thus, recommendation P.830 states that
   "it is recommended that a minimum set of experiments be conducted,
   which, although they would not cover every combination, would result
   in sufficient data to make sensible decisions. [...] Extreme caution
   should be used when comparing systems with widely differing
   degradations, e.g. digital codecs, frequency division multiplex
   systems, vocoders, etc., even within the same test."

   Step 4: Selection of a Test Procedure and Conduct of the Experiment.
   Here, the considerations as in B.4/P.80 apply. However, a modified
   IRS at the receiver shall be used (narrow band) or an IEC
   Publication 581.7 filter (wideband). Also, "Gaussian noise
   equivalent to -68 dBmp should be added at the input to the receiving
   system to reduce noise contrast effects at the onset of speech
   utterances."

   Step 5: Analysis of Results. Again, the considerations detailed in
   B.4.7/P.80 apply. The arithmetic mean (over subjects) is to be
   calculated for each condition at each listening level.

7.2. Testing procedure for the ITU-T G.719

   Recently, the ITU-T has standardized the audio and speech codec ITU-
   T G.719. The G.719 has similar properties as the anticipated IIAC,
   thus the optimization and characterization of the G.719 is of
   particular interest.



Hoene                 Expires December 3, 2011               [Page 24]


Internet-Draft              Codec Quality                    June 2011


   In the following, we will describe the "Quality Assessment Test
   Plan" in TD 322 and 323 [33][35]. The ITU Study Group 16 used ITU-R
   BS.1116 to tests sample items. Audio sample items were sampled at 48
   kHz mixed down to mono. Speech sample items contain one sentence
   with a duration of 4 s, mixed content had a duration of 5-6 s and
   music a duration of between 10 and 15 s. The beginning and ending of
   the samples were smoothed. Also, a filter was applied to limit the
   nominal bandwidth of the input signal to the range of 20 to 20000
   Hz. As for the mixed content, advertisements, film trailers and news
   (including a jingle) have been selected. For music items, classical
   and modern styles of music have been selected. Besides the codec
   under test, test stimuli degraded with LAMP MP3 and G722 were added
   to the tests. Some test stimuli have been modified to include
   reverberations or an interfering talker and office noise. Some tests
   were done studying the effect of a frame erasure rate of 3% having
   random loss patterns. All listening labs used different sample items
   and attention paid to not use the same material twice.

   Listening labs were required to provide the results of 24
   experienced listeners excluding those listeners, who did not passed
   a pre- and post-screening. The experienced listeners should "neither
   have a background in technical implementations of the equipment
   under test nor do they have detailed knowledge of the influence of
   these implementations on subjective quality".

   During the tests, "circum aural headphones - open back for example:
   STAX Signature SR-404 or Sennheiser HD-600) on both ears (diotic
   presentation)" were used. The listening levels were -26 dB relative
   to OVL.

   Some results of the listening tests are given in TD 341 R1 [34]. In
   those tests, they also compared the subjective ratings that were
   made following BS.1116 with the objective ratings of ITU-R BS.1387-
   1. The correlation between objective and subjective ratings was
   below R=0.9.

8. Transmission Channel

   Between speech encoder and decoder lies a transmission channel that
   effects the transmission. For cellular or wireless phones, the
   typical transmission channel is assumed to be equal to the wireless
   link(s). This typically means, that a circuit switch link is assumed
   (e.g., in GSM, UMTS, DECT). The bandwidth is typically constant in
   DECT and GSM or variable in a given range depending on the quality
   of the wireless transmission (UMTS). Bit errors do occur but they
   don't be equally distributed if unequal bit error correction is
   applied (UMTS).


Hoene                 Expires December 3, 2011               [Page 25]


Internet-Draft              Codec Quality                    June 2011


   In the case of the IIAC codec, the transmission channel is the
   internet. More precisely, it is the packet transmission over the
   Internet, plus the transport protocol (e.g. UDP, TCP, DCCP), plus
   potentially Forward Error Correction, and plus dejittering buffers.

   Also, the transmission channel is reactive. It changes its
   properties depending on how much data is transmitted. For example,
   parallel TCP flows reduce their transmission bandwidth in the
   presence of an unresponsive UDP stream.

   Overall, one can say that the transmission channel "Internet" is
   difficult to understand. Thus, in this chapter, we try to shed light
   on the question of what types of transmission channels a codec has
   to cope with.

8.1. ITU-T G.1050: Network Model for Evaluating Multimedia Transmission
   Performance over IP (11/2007)

   The current ITU-T G.1050 standard [20] describes layer 3 packet
   transmission models that can be used to evaluate IP applications.
   The models are of statistical nature. They consider networks
   architectures, types of access links, QoS controlled edge routing,
   MTU size, networks faults, link failures, route flapping, reordered
   packets, packet loss, one-way delay, variable deploys and background
   traffics.

   G.1050 is a network model consisting of three parts, LAN a, LAN b,
   and an interconnection core. Both LANs can have different rates and
   occupancy and can be of different types. LAN and core are connected
   via access technologies, which might vary in data rate, occupancy
   and MTU size.

   The core is characterized by route flapping, link failures, one-way
   delay, jitter, packet loss and reordered packets. Route flaps are
   repeatedly changed in a transmission path because of alternating
   routing tables. These routing updates cause incremental changes in
   the transmission delays. A link failure is a period of consecutive
   packet loss. Packet losses can be bursty having a high loss rate
   during bursts and having otherwise a lower loss rate otherwise.
   Delays are modeled via multiple different jitter models supporting
   delay spikes, random jitter and filtered random jitters.

   The standard recommends three profiles, named "Well-managed IP
   network", "Partially-managed IP network", and "Unmanaged IP Network,
   Internet", which differ in their connection qualities.




Hoene                 Expires December 3, 2011               [Page 26]


Internet-Draft              Codec Quality                    June 2011


   Limitations to these models are the missing cross-correlation
   between packet delays and packet loss events, the lack of
   responsiveness to the tests application flow, and the lack of link
   qualities that vary with time.

8.2. Draft G.1050 / TIA-921B

   Currently, an enhancement to ITU-T G.1050 (11/2007) is being
   developed (e.g. [13])). It does not use a statistical model but
   takes advantage of the NS/2 simulator. Thus, most of the above
   mentioned limitations have been overcome.

   Despite that, even the new model does not yet give an answer to the
   question of which distributions of typical Internet connection
   qualities can be expected.

8.3. Delay and Throughput Distributions on the Global Internet

   In general, it is not precisely known how the qualities of end-to-
   end connections are distributed. It is also unclear whether the
   anticipated IIAC Codec will be used globally or whether its area of
   usage will be somehow restricted.

   Despite the fact, that the codec has to be optimized for an unknown
   Internet, the following scientific publications give an estimate on
   how different Internet end-to-end paths might behave. One recent
   example is on studies about the residential broadband Internet
   access traffic of a major European ISP [37].




















Hoene                 Expires December 3, 2011               [Page 27]


Internet-Draft              Codec Quality                    June 2011


         +------------------------------------------------------------+
   p 0.6-+                                                            |
   r     |  e eDonkey                                                 |
   o     |                           ee                               |
   b     |  H HTTP                   e e                              |
   a     |                          ee e                              |
   b     |                          e  e                              |
   i 0.4-+                          e   e                             |
   l     |                          e   e                             |
   i     |                         e    e                             |
   t     |                         e    e     HHHH                    |
   y     |                         e     e  HHHHHHHHH                 |
         |                        ee     e HH       HH                |
   d 0.2-+                        e      eHH         HH               |
   e     |                        e       H           HH              |
   n     |                       ee      He            HH             |
   s     |            ee         e      HH e            HH            |
   i     |           e  ee      e     HH    e            HHH          |
   t     |         ee     eeeeee HHHHHH      eeee          HHH        |
   y 0.0-+ eHeHeHeHHHHHHHHHHHHHHH                eeeeeeeeeeeeeHHHHHHH |
         +----+---------+---------+--------+---------+---------+------+
              |         |         |        |         |         |
             0.1       1.0       10       100      1000      10000

                              Throughput [kbps]

    Figure 2 Achieved throughput of flows measured for eDonkey and HTTP
                             applications [37]

   Figure 2 displays the throughput distribution of TCP connections for
   eDonkey peer-to-peer and HTTP applications. It only considers single
   flow with a length of more than 50 Kbyte. But typically, a web
   browser uses two to three TCP connections at the same time and an
   eDonkey client about 10. Still, the throughput of a single HTTP flow
   is in about an order faster than the of eDonkey flow. In [37], the
   authors assume this is due to the fact that peer-to-peer connections
   fill the uplink and that HTTP is used at the faster downlink.











Hoene                 Expires December 3, 2011               [Page 28]


Internet-Draft              Codec Quality                    June 2011


         +------------------------------------------------------------+
         |                                                            |
         |                    **                                      |
   p 0.8-+                    **                                      |
   r     |                    ***                                     |
   o     |                    * *                                     |
   b     |                   ** *                                     |
   a 0.6-+                   *  *                                     |
   b     |                   *  **                                    |
   i     |                   *   *                                    |
   l     |                   *   *                                    |
   i     |                   *   *                                    |
   t 0.4-+                  **   **                                   |
   y     |                  *     *                                   |
         |                  *      * ****                             |
   d     |                  *       *    *                            |
   e 0.2-+                  *            **                           |
   n     |                 **             **                          |
   s     |            **** *               ***                        |
   i     |          ***  ***                 ***                      |
   t     |        ***                          **************         |
   y 0.0-+*********                                  *****************|
         +-------+-----------------+----------------+-----------------+
                 |                 |                |                 |
                10                100             1000            10000

                                        RTT [ms]

                     Figure 3 TCP roundtrip times [36]

   Figure 3 displays TCP roundtrip times including both access and
   backbone network. Both graphs can be seen as an indication for the
   assumption that an application, even in modern Internet access
   networks, might be subjected to a wide variability of throughput
   ranging from a few kbits/s up to 10 Gbit/s and TCP round trip times
   from 5ms up to one of several seconds.

   Albeit these results are only valid for TCP, similar results should
   be expected for RTP over UDP - with a small advantage because UDP
   flows are not always responsive.

   As a summary, a codec for the Internet should be able to work under
   these widely varying transmission conditions and should be tested
   against a wide distribution of expected throughputs.





Hoene                 Expires December 3, 2011               [Page 29]


Internet-Draft              Codec Quality                    June 2011


8.4. Transmission Variability on the Internet

   Besides effects such as route flapping or link failures modeled in
   G.1050 [20], the Internet experience in short-time scales sharp
   changes sharply in bandwidth utilization. For example, [49] and [38]
   showed that variability of Internet traffic comes in form of spike
   like traffic increments. Similarly, [32] studied why the Internet is
   bursty in time scales of between 100 and to 1000 milliseconds.

   In the light of these results, one can assume that the IIAC's
   transmission conditions will vary in similar time scales. More
   precisely, it will be subjected to

   .  variability due to bursty traffic having a duration of between
      100 and 1000 milliseconds,

   .  interruptions due to temporal link failures every minute to every
      hour that might have a temporal interruption from 64 ms to
      several seconds [20], and

   .  route flap events every minute to every hour that have a delay of
      between 2 and 128 ms [20].

8.5. The Effects of Transport Protocols

   Realtime multimedia is not always transported over RTP and UDP.
   Sometimes it makes sense to use a different transport protocol or an
   additional rate adaptation. The reasons for that are manifold.

   .  If a scalable codec shall be supported, RTCP-based feedback
      information can be utilized to implement a rate control
      mechanisms [41]. However, RTCP-based feedback suffers from the
      drawback that RTCP messages are allowed only every 5 s. Thus,
      implementing a fast responding mechanism is not possible.

   .  In the presence of restricted firewalls, VoIP can sometimes only
      be transmitted over TCP. In those cases, the transmission
      scheduling is not given by the codec but by TCP. TCP algorithms
      typically don't have a smooth sending rate but frequently send
      packets in bursts and change the amount of packets sent every
      round trip time (Figure 4). More precisely, TCP causes the
      sending schedule to behave in the following way:

        .  During the Slow Start phase (for example at the beginning of
          a TCP connection) the transmission rate increases
          exponentially.



Hoene                 Expires December 3, 2011               [Page 30]


Internet-Draft              Codec Quality                    June 2011


        .  If a TCP segment is not acknowledged after about four RTTs,
          the TCP sending rate starts at one packet per RTT again.

        .  During congestion avoidance, the sending rate increases
          steadily by one segment per RTT.

        .  If a congestion event is then detected, the sending rate is
          reduced by 50%.

   p 15-+-------------------------------------------------------------+
   a    |                                                             |
   c    |             **                  **              **          |
   k    |           ** *                ** *            ** *          |
   e    |         **   *              **   *          **   *          |
   t    |       **     *            **     *        **     *        **|
   s    |     **       *          **       *      **       *      **  |
      8-+   **         *        **         *    **         *    **    |
   p    |   *          *      **           *  **           *  **      |
   e    |   *          *      *            ***             ***        |
   r    |   *          *      *                                       |
      4-+  *           *     *                                        |
   R    |  *           *     *                                        |
   T  2-+ *            *    *                                         |
   T  1-+*             *   *                                          |
        +---------+---------+---------+---------+---------+---------+-+
        |         |         |         |         |         |         |
        0        10        20        30        40        50        60

                     time in round- trip times (RTT)

             Figure 4 Sending rate of a standard TCP over time

















Hoene                 Expires December 3, 2011               [Page 31]


Internet-Draft              Codec Quality                    June 2011


   .  The DCCP transport protocol supports multiple congestion control
      protocols and gives means to support TCP friendliness without
      retransmission. Thus, it is suitable for real time multimedia
      transmissions. DCCP supports a TCP emulation, which shows a
      similar rate over time as TCP, and the TFRC congestion control,
      which changes its rate in a smoother way (Figure 5).
      Besides TFRC, which is intended to transmit packets of maximal
      size (aka MTU), TFRC-SP is optimized for flows with variable
      packet sizes such as VoIP. With TFRC-SP, smaller packets can be
      transmitted at a faster pace than it is the case for larger
      packets because they contribute less to the gross bandwidth
      consumption.
      The TFRC protocol might provide a lower bandwidth and a lower QoE
      as UDP or TCP, unless if not proper optimizations are taken (see
      [48]). Also, it is suggested to limit the rate control to 100
      packets per second. This limit might be too low for an IIAC.


   p 15-+-------------------------------------------------------------+
   a    |                                                             |
   c    |             **                  **                  **      |
   k    |           **  **              **  **              **  **    |
   e    |         **      **          **      **          **      **  |
   t    |       **          **      **          **      **          **|
   s    |     **              **  **              **  **              |
      8-+   **                  **                  **                |
   p    |   *                                                         |
   e    |   *                                                         |
   r    |   *                                                         |
      4-+  *                                                          |
   R    |  *                                                          |
   T  2-+ *                                                           |
   T  1-+*                                                            |
        +---------+---------+---------+---------+---------+---------+-+
        |         |         |         |         |         |         |
        0        10        20        30        40        50        60

                    time in round- trip times (RTT)

                Figure 5 Sending rate of the TFRC protocol

   In general, the transport protocol has a clear influence on the
   transmission conditions. Coding rates need to be adapted by sharply
   and smoothly to changed bandwidth estimations. Changes of the
   bandwidth estimation may occur every RTT. Also, in cases of a TCP
   timeout, the transmission is halted and the decoding must be
   stalled.


Hoene                 Expires December 3, 2011               [Page 32]


Internet-Draft              Codec Quality                    June 2011


8.6. The Effect of Jitter Buffers and FEC

   Both jitter buffers trade frame losses against delay. In cases of a
   jitter buffer, frames are delayed before playout. This helps in
   cases of lately arriving frames that would otherwise be ignored and
   would have to be concealed. Jitter buffers are adaptive and are
   changing dynamically to the current loss process on the Internet.

   Forward Error Correction helps to cope with isolated losses as
   redundant speech frames are transmitted in the following packets. In
   the presence of loss, FEC increases the delay because the receiver
   has to wait for the following packets. Both delay and packet losses
   are important contributors to the overall Quality of Experience [2].

   Since the delay process on the Internet often comes in the form of a
   gamma distribution, thus a statistical monitor of past delays helps
   to predict the size of future jitter. Then, if the playout schedule
   does not match the predicted loss process, playout can be
   accelerated or slowed down.

   However, due to the reasons described in Section 8.4 not all
   increments in transmission time might be predictable. This has a
   profound effect on the jitter buffer as it actually cannot predict
   well, whether a frame is lost or whether it is going to be delayed.
   If a frame is scheduled for playout but has not been received, the
   jitter buffer has to consider two cases. First, the frame is lost
   and has to be concealed. This typically means that the audio signal
   needs to be extrapolated or interpolated to conceal the gap due to a
   lost frame. Second, the frame is delayed and shall be played out at
   a later point in time. Then, the resulting gap in playout must be
   concealed by extrapolating the previous audio signal.

   These issues have an effect on testing the concealment algorithm of
   the codec. The same concealment function must be tested against time
   gap concealment and loss concealment.

8.7. Discussion

   Judging a codec performance using a realistic model of a
   transmission channel is difficult. Good models of IP transmission
   channels are available. However, before a codec can be tested
   against those channels, further building blocks such as the
   transport protocol, the jitter buffer, and FEC should be known - at
   least roughly.

   Alternatively, a codec can be tested only against of packet loss
   patterns only without considering any rate adaption or playout


Hoene                 Expires December 3, 2011               [Page 33]


Internet-Draft              Codec Quality                    June 2011


   rescheduling. But then again, the codec should be additionally
   tested for those impairments, which occur due to the dynamics of the
   Internet. These include

   o slowing down and speeding up the playout in cases of moderate
      rescheduling of playout times,

   o stalling and resuming the playout in cases of temporal link
      outages,

   o moderately reducing and increasing bit and frame rates during
      contention periods, and

   o sharply reducing (in case of congestion) and fast increasing
      (during connection establishment) of bit and frame rates.

   o Time gap and loss concealment.

   o Speeding up and slowing down the playout speed.

9. Usage Scenarios

   Quality of Experience is the service quality perceived subjectively
   by end-users (refer to Section 2) and as ITU-T document G.RQAM [21]
   states "overall acceptability may be influenced by user expectations
   and context". Thus, in this section we describe the usage scenarios,
   in which the IIAC codec will probably be used, and the expectations
   users have in those communication contexts. We list seven main
   scenarios and describe their quality requirements.

9.1. Point-to-point Calls (VoIP)

   The classic scenario is that of the phone usage to which we will
   refer in this document as Voice over IP (VoIP). Human speech is
   transmitted interactively between two Internet hosts. Typically,
   besides speech some background noise is present, too.

   The quality of a telephone call is traditionally judged by
   subjective tests such as those described in [24]. The ACR scale used
   in MOS-LQS sometimes might not be very suitable for high quality
   calls, then - for example - the MUSHRA [16] rating can be applied.

   A telephone call is considered good if it has a maximal mouth-to-ear
   delay of 150 ms [17] and a speech quality of MOS-LQS 4 or above.
   However, interhuman communication is still possible if the mounth-
   to-ear delay is much larger.



Hoene                 Expires December 3, 2011               [Page 34]


Internet-Draft              Codec Quality                    June 2011


   The effect of delay jitter might not be very well notable in case of
   speech. Thus, playout rescheduling can happen often take place.

   In many cases, phone calls are made between mobile devices such as
   mobile phones and cellular phone. In these cases, energy consumption
   is crucial and both complexity and transmission rate may be reduced
   to save resources.

9.2. High Quality Interactive Audio Transmissions (AoIP)

   In this scenario we consider a telephone call having a very good
   audio quality at modest acoustic one-way latencies ranging from 50
   and 150 ms [17], so that music can be listened to over the telephone
   while two persons are talking interactively.

   While delay expectations might be similar to those of classic
   telephony, the audio quality must meet similar standards as those of
   consumer Hifi equipment like MP3 and CD players, iPods, etc.

   If music is played, playout rescheduling events may be heard easily
   be heard as the rhythm changes. Only a few studies such as [10] have
   been made to examine the effect of time varying delays on service
   quality. In general, it can be assumed that the requirements
   regarding constancies of playout schedules are higher than in case
   of speech because human beings can notice rhythmic changes easily.
   Thus, in the presence of music, frequent playout rescheduling shall
   be avoided.

9.3. High Quality Teleconferencing

   Also, for today's teleconferencing and videoconferencing systems
   there is a strong and increasing demand for audio coding providing
   the full human auditory bandwidth of 20 Hz to 20 kHz. This rising
   demand for high quality audio is due to the following reasons:

   o Conferencing systems are increasingly used for more elaborated
      presentations, often including music and sound effects which
      occupy a wider audio bandwidth than that of speech. For example,
      Web conferences such as WebEx, GoToMeeting, Adobe Acrobat Connect
      are based on an IP based transmission.

   o The new "Telepresence" video conferencing systems, providing the
      user with High Definition video and audio quality, create the
      experience of being in the same room by introducing high quality
      media delivery (such as from Cisco).




Hoene                 Expires December 3, 2011               [Page 35]


Internet-Draft              Codec Quality                    June 2011


   o The emerging Digital Living Rooms are to be interconnected and
      might require a constant high quality acoustic transmission at
      high qualities.

   o Spatial audio teleconference solutions increase the quality
      because they take advantage of the cocktail-party effect. By
      taking advantage of 3D audio, participants can be identified by
      their location in a virtual acoustic environment and multiple
      talkers can be distinguished from each other. However, these
      systems require stereo audio, if the spatial audio is rendered
      for headphones.

9.4. Interconnecting to Legacy PSTN and VoIP (Convergence)

   This scenario does not include the use case of using a VoIP-PSTN
   gateway to connect to legacy telephone systems. In those cases, the
   gateway would make an audio conversion from broadband Internet voice
   to the frugal 1930's 3.1 kHz audio bandwidth.

   The quality requirements in this scenario are low because legacy
   PSTN typically uses narrow-band voice. Also, in those cases one
   might expect the codec negotiation might decide on a common codec
   both for PSTN and VoIP in order to avoid transcoding.

   However, the complexity requirements might be stringent because
   central media gateways must scale to a high number of users. In this
   context, hardware costs are an important criterion and the codec has
   to operate efficient.

9.5. Music streaming

   Music streaming typically does not require low delays. However, in
   special cases such as live events and in the presence of alternative
   transmission technologies, low-delay streaming may be demanded.

   Examples are important sport events, which are streamed both on
   terrestrial, (analogue) and low delay broadcast networks and on IP-
   based distribution networks. The latter ones becomes aware (such as
   when a footballer scores) more lately than the ones their neighbors
   using terrestrial technology.

9.6. Ensemble Performances over a Network

   In some usage scenarios, users want to act simultaneously and not
   just interactively. For example, if persons sing in a chorus, if
   musicians jam, or if e-sportsmen play computer games in a team
   together they need to communicate acoustically.


Hoene                 Expires December 3, 2011               [Page 36]


Internet-Draft              Codec Quality                    June 2011


   In this scenario, the latency requirements are much harder than for
   interactive usages. For example, if two musicians are placed more
   than 10 meters apart, they can hardly stay synchronized. Empirical
   studies [10] have shown that if ensembles play over networks, the
   optimal acoustic latency is at around 11.5 ms with a targeted range
   from 10 to 25 ms.

   Also, the users demand very high audio quality, very low delay and
   very few events of playout rescheduling.

9.7. Push-to-talk like Services (PTT)

   In spite of the development of broadband access (xDSL), a lot of
   users do only have service access via PSTN modems or mobile links.
   Also, on these links the available bandwidth might be shared among
   multiple flows and is subjected to congestion. Then, even low coding
   rates of about 8 kbps are too high.

   If transmission capacity hardly exists, one can still degrade the
   quality of a telephone call to something like a push-to-talk (PTT)
   like service having very high latencies. Technically, this scenario
   takes advantage of bandwidth gains due to disruptive transmission
   (DTX) modes and very large packets containing multiple speech frames
   causing a very low packetization overhead.

   The quality requirements of a push-to-talk like service have hardly
   been studied. The OMA lists as a requirement of a Push-to-talk over
   cellular service a transmission delay of 1.6 s and a MOS values of
   above 3.0 that typically should be kept [39]. However, as long as an
   understandable transmission of speech is possible, the delay can be
   even higher. For example, [39] allows a delay of typically up to 4 s
   for the first talk-burst. Also, [39] describes a maximum duration of
   speaking. If a participant speaking reaches the time limit, the
   participant's righttospeak shall be automatically revoked.

   If the quality of a telephone call is very low, then instead of
   listening-only speech quality the degree of understandability can be
   chosen as performance metric. For example, objective tests of the
   understandability use automatic speech recognition (ASR) systems and
   measure the amount of correctly detected words.

   In any case, the participant shall be informed about the quality of
   connection, the presence of high delays, the half-duplex style of
   communication, and its (limited) righttospeak. For example this can
   be achieved by a simulated talker echo.




Hoene                 Expires December 3, 2011               [Page 37]


Internet-Draft              Codec Quality                    June 2011


9.8. Discussion

   The requirements of the usage scenarios are summarized in the
   following table.

                |     Sound Quality  |      Latency       | Complexity
     Scenario   | low  | avg. | hifi | 10ms | 150ms| high | low  | high
   -------------+------+------+------+------+------+------+------+-----
   VoIP         |   X  |      |      |      |   X  |      |   X  |   X
   AoIP         |      |   X  |   X  |      |   X  |      |      |   X
   Conference   |      |   X  |      |      |   X  |      |      |   X
   Convergence  |   X  |      |      |      |   X  |      |   X  |   X
   Streaming    |      |   X  |   X  |      |      |   X  |      |   X
   Performances |      |      |   X  |   X  |      |      |      |   X
   Push-To-Talk |   X  |      |      |      |      |   X  |   X  |   X

       Figure 6 Different requirements for different usage scenarios

10. Recommendations for Testing the IIAC

   The IETF IIAC differs substantially from a classic narrow and
   wideband codec. Thus, the previously applied codec testing
   procedures such as ITU P.830 cannot be entirely adopted. Instead,
   one must check carefully, which of the procedures are used without
   changes, which procedures are used with minor changes and which
   procedures are dropped or replaced.

   In Section 1 we listed five groups of stakeholders, which have
   different requirements and demands on how to test the quality of an
   IIAC. In the following, we recommend testing procedures for those
   stakeholders.

10.1. During Codec Development

   The codec development is an innovative process. In general,
   innovation and research in general benefits from openness and
   discussion between experts. Thus, format restrictions on how to test
   the codec might hinder the codec development because innovation may
   also take place in testing procedures. Instead, many experts both in
   codec development and codec usage shall be able to participate. If
   this is the case, they contribute with their expertise, identify
   weaknesses, and discuss potential codec enhancements. During
   innovation, openness in participation and discussion is very
   fruitful and leads to good results.

   Based on the ongoing experience, codec developers know best on how
   to tests their codecs. Typically, those tests include informal


Hoene                 Expires December 3, 2011               [Page 38]


Internet-Draft              Codec Quality                    June 2011


   testing, semiformal testing, and expert interviews. They are
   intended to find weaknesses in the codec, to identify artifacts or
   distortions, and to achieve algorithmic progress.

10.2. Characterization Phase

   The characterization phase is intended to study the features, the
   quality tradeoff and the properties of a codec under
   standardization. It is intended to be an objective measure of the
   codec's quality to convince third parties of the quality properties
   of the standardized codec. In order to achieve this aim, a formal
   testing procedure has to be established.

   In general, we recommend to base the procedure of the
   characterization phase on procedures that are similar to those that
   were used for the G.719 standardization (Section 7.2 and especially
   [35]). In the following, we describe the suggested testing procedure
   in the characterization phase.

10.2.1. Methodology

   The testing of sound quality can be done using the MUSHRA tests
   with eight samples and three anchors. One anchor is the known
   reference, the second one is a hidden reference, and the third one
   the hidden anchor. It is suggested to use a bandwidth filtered
   signal with at low-pass filter at 3.5 kHz. However, because a will
   range of qualities are to be tested ranging from Hifi down to toll
   quality, it is beneficial to add a further low quality anchor such
   as a 3.5 kHz bandwidth sample distorted by modulated noise (MNRU)
   [25], for example with MNRU of a strength of Q=25 dB that
   corresponds to a MOS value of 1.79 [4].

10.2.2. Material

   Reference samples should be 48 kHz sampled, stereo channel material.
   The nominal bandwidth of the reference samples shall be limited to
   the range of 20 to 20000 Hz. Three different kinds of contents shall
   be tested: speech, music and mixed content.

   Speech samples shall include different languages including English
   and tonal languages. The speech samples shall be recorded in a quiet
   environment without background noise or reverberations. The speech
   samples shall contain one meaningful sentence having a length of
   about 4 s.

   Music samples shall contain a wide variety of music styles including
   classical music, pop, jazz, and single instruments. The length of


Hoene                 Expires December 3, 2011               [Page 39]


Internet-Draft              Codec Quality                    June 2011


   samples shall be of between 10 and 15 s. A smoothing of 100 ms both
   at the beginning and at the end shall be conducted, if required.

   Mixed content may contain advertisements, film trailers, news with
   jingles and other mixtures of speech, music and noises. The length
   may be at about 5-6 s.

10.2.3. Listening Laboratory

   Multiple independent laboratories shall conduct the listening tests.
   They are responsible for generating or selecting reference samples
   as well as for the pre and post screening of subjects. In the end,
   the results of about 24 experienced listeners shall be published (in
   addition to the samples).

   The tests must be conducted in a quiet listening environment at
   about NC25 (approximate 35 dBA). For example, an ISOBOOTH room can
   be used.

   It is recommended to use a high quality D/A, such as Benchmark DAC,
   Metric Halo ULN-2, Apogee MiniDAC. High quality headphone amplifiers
   and playback level calibration shall be used. Playback levels might
   be measured via Etymotic in-ear microphones. Also, high quality
   headphones (e.g. AKG 240DF, Sennheiser HD600) are advisable.

10.2.4. Degradation Factors

   The IIAC is likely to be highly configurable. However, due to time
   limits, only a few parameter sets can be tested subjectively. Thus,
   we recommend to do subjective studies with

   o different bit rates (from low to high, 5 tests)

   o different frame rates (from low to high, 2 tests)

   o different loss pattern (G.1050 profile A, B, and C at low rate
      with speech content and at high rate with music content. The
      influence of jitter, delay, and link failures shall be ignored.
      In total, this would be 6 tests)

   o different sample contents

        o Speech, speech+reverberations, and speech+noise+reverberations
          at low and medium rates (3 tests).

        o The speech sample must be tested in different languages
          (English, Chinese, ...) and with male/female voices (6 tests)


Hoene                 Expires December 3, 2011               [Page 40]


Internet-Draft              Codec Quality                    June 2011


        o Mixed content and music shall be tested at medium and high
          rates (about 10 tests).

   o A low complexity mode, DTX and the FEC mode shall be tested at
      low rates because they are typically used on constraint devices
      (3 tests)

   o Abrupt changes in bit and frame rates (reduction by half,
      exponential start, 2 tests)

   o Smooth changes of bit and frame rates (incrementing or degreasing
      the codec's gross rate by 1.5 kbyte every 100ms, 2 tests)

   o Stall and continue operations (20, 200, and 1000 ms, 3 tests)

   o Accelerated and slowed down playout (+- 10% for speech at low
      rates)

   o Reference codecs such as LAME MP3, G.719, and AMR each at two
      coding rate (6 tests)

   Already, these are 48 different tests that need to be conducted.

   In addition, for intermediate values objective tests shall be run
   using PEAQ (for music) and P.OLQA (for speech). The intermediate
   results shall be mapped on the MUSHRA scale with a quadratic
   regression because PEAQ and P.OLGA are using an ODG and MOS scale
   respectively.

10.3. Application Developers

   Application developers can take advantage of the results of the
   qualification phase. They may use the results to develop a quality
   model, which describes the expected quality of the codec at a given
   parameter set (refer to [11] for an example).

   In addition, they can test their system using the draft G.1050
   simulation model, which is especially useful for optimizing rate
   control, dejittering buffers and concealment algorithms. Different
   systems may be tested with quality models, subjective listening
   tests, conversational listening tests, or with objective measures
   such as POLQA.

   Also, field tests may be conducted to test the effect of a real
   network on the VoIP application.




Hoene                 Expires December 3, 2011               [Page 41]


Internet-Draft              Codec Quality                    June 2011


10.4. Codec Implementers

   To tests the conformance of a codec, codec implementers can use
   objective tools like PEAQ or P.OLQA to see, whether the newly
   implemented codec performs in a way that is similar to the
   performance of the reference implementation. These tests shall be
   done for many different parameter sets.

10.5. End Users

   End user may be included in the qualification tests. The intentions
   of these tests are two-fold. First, the awareness of the end-user
   shall be increased. Second, querying users may be a cost effective
   way of conducting listening-only tests.

   However, before the rating results of end users can be considered
   for further usage, one need to compare between formal and web-based
   testing results to see, to what extent they differ from each other.

11. Security Considerations

   The results of the quality tests shall be convincing. Thus, special
   care has to be taken to make the tests precise, accurate, repeatable
   and trustworthy.

   Some testing houses may have a conflict of interest between accurate
   quality ratings and promotion of own codecs. Thus, a high degree of
   openness shall be enforced that requires all of the testing material
   and results to be published. This way, others may verify the results
   of testing houses. In addition, some stimuli shall be tested by all
   the testing houses to compare their quality of rating.

   Moreover, hidden anchors may help to identify subjects, which rate
   the quality of samples less precisely.

12. IANA Considerations

   This document has no actions for IANA.










Hoene                 Expires December 3, 2011               [Page 42]


Internet-Draft              Codec Quality                    June 2011


13. References

13.1. Normative References

13.2. Informative References

   [1]  R. Birke, M. Mellia, M. Petracca, D. Rossi, "Understanding
         VoIP from Backbone Measurements", IEEE INFOCOM 2007, 26th IEEE
         International Conference on Computer Communications, pp.2027-
         2035, May 2007.

   [2]  C. Boutremans, J.-Y. Le Boudec, "Adaptive joint playout buffer
         and FEC adjustment for Internet telephony," IEEE Societies
         INFOCOM 2003. Twenty-Second Annual Joint Conference of the
         IEEE Computer and Communications., vol.1, pp. 652- 662 vol.1,
         30 March-3 April 2003.

   [3]  Broadcom, "BCM1103: GIGABIT IP PHONE CHIP", Jan. 2005,
         http://www.datasheetcatalog.org/datasheet2/3/07ozspx224dsarq6z
         u13i2ofyqyy.pdf

   [4]  N. Cote, V. Koehl, V. Gautier-Turbin, A. Raake, S. Moeller,
         "Reference Units for the Comparison of Speech Quality Test
         Results", Audio Engineering Society Convention 126, May 2009.

   [5]  Ericsson, "Analysis of PEAQ's applicability in predicting the
         quality difference between alternative implementations of the
         G.722.1FB coding algorithm", ITU-T SG12, Received on 2008-05-
         09, Related to question(s) : Q9/12, Meeting 2008-05-22.

   [6]  ETSI TC-TM, "ETR 250: Transmission and Multiplexing (TM);
         Speech communication quality from mouth to ear for 3,1 kHz
         handset telephony across networks", ETSI Technical Report,
         July 1996.

   [7]  S. Floyd, E. Kohler, "Profile for Datagram Congestion Control
         Protocol (DCCP) Congestion ID 4: TCP-Friendly Rate Control for
         Small Packets (TFRC-SP)", RFC 5622, August 2009.

   [8]  S. Floyd, E. Kohler, "TCP Friendly Rate Control (TFRC): The
         Small-Packet (SP) Variant", RFC 4828, April 2007.

   [9]  J. Gruber, G. Williams, Transmission Performance of Evolving
         Telecommunications Networks, Artech House, 1992.





Hoene                 Expires December 3, 2011               [Page 43]


Internet-Draft              Codec Quality                    June 2011


   [10] M. Gurevich, C. Chafe, G. Leslie, S. Tyan, "Simulation of
         Networked Ensemble Performance with Varying Time Delays:
         Characterization of Ensemble Accuracy", Proceedings of the
         2004 International Computer Music Conference, Miami, USA,
         2004.

   [11] C. Hoene, H. Karl, A. Wolisz, "A perceptual quality model
         intended adaptive VoIP applications", International Journal of
         Communication Systems, Wiley, August 2005.

   [12] J. Holub, J.G. Beerends, R. Smid, "A dependence between
         average call duration and voice transmission quality:
         measurement and applications," Wireless Telecommunications
         Symposium, 2004, pp. 75- 81, May 2004.

   [13] ITU, "Incoming LS: Proposed G.1050/TIA-921B IP Network Model
         Simulation", ITU-T SG 12, Temporary Document 268-GEN, May 12,
         2010.

   [14] ITU, "ITU-R BS.1116-1: Methods for the subjective assessment
         of small impairments in audio systems including multichannel
         sound systems", Recommendation, October 1997.

   [15] ITU, "ITU-R BS.1387: Method for objective measurements of
         perceived audio quality", Recommendation, November 2001.

   [16] ITU, "ITU-R BS.1534-1: Method for the subjective assessment of
         intermediate quality levels of coding systems",
         Recommendation, January 2003.

   [17] ITU, "ITU-T G.107: The E-model: a computational model for use
         in transmission planning", Recommendation, April 2009.

   [18] ITU, "ITU-T G.114: One-way transmission time", Recommendation,
         May 2003.

   [19] ITU, "ITU-T G.191: Software tools for speech and audio coding
         standardization", Recommendation, March 2010.

   [20] ITU, "ITU-T G.1050: Network model for evaluating multimedia
         transmission performance over Internet Protocol",
         Recommendation, November 2007.

   [21] ITU, "ITU-T G.RQAM, "Reference guide to QoE assessment
         methodologies", standard draft TD 310rev1, May 2010.




Hoene                 Expires December 3, 2011               [Page 44]


Internet-Draft              Codec Quality                    June 2011


   [22] ITU, "ITU-T P.10/G.100: Vocabulary and effects of transmission
         parameters on customer opinion of transmission quality",
         Recommendation, July 2006.

   [23] ITU, "ITU-T P.800: Methods for objective and subjective
         assessment of quality", Recommendation, August 1996.

   [24] ITU, "ITU-T P.805: Subjective evaluation of conversational
         quality", Recommendation, April 2007.

   [25] ITU, "ITU-T P.810: Modulated noise reference unit (MNRU)",
         Recommendation, February 1996.

   [26] ITU, "ITU-T P.830: Subjective performance assessment of
         telephone-band and wideband digital codecs", Recommendation,
         February 1996.

   [27] ITU, "ITU-T P.862: Perceptual evaluation of speech quality
         (PESQ): An objective method for end-to-end speech quality
         assessment of narrow-band telephone networks and speech
         codecs", Recommendation, February 2001.

   [28] ITU, "ITU-T P.862.1: Mapping function for transforming P.862
         raw result scores to MOS-LQO", Recommendation, November 2003.

   [29] ITU, "ITU-T P.862.2: Wideband extension to Recommendation
         P.862 for the assessment of wideband telephone networks and
         speech codecs", Recommendation, November 2007.

   [30] ITU, "ITU-T P.862.3: Application guide for objective quality
         measurement based on Recommendations P.862, P.862.1 and
         P.862.2", Recommendation, November 2007.

   [31] ITU, "ITU-T P.880: Continuous evaluation of time-varying
         speech quality", Recommendation, May 2004.

   [32] H. Jiang, C. Dovrolis, "Why is the internet Traffic Bursty in
         Short Time Scales?" Sigmetrics'05, Banff, Alberta, Canada,
         June 2005.

   [33] C. Lamblin, R. Even, "Processing Test Plan for the ITU-T
         G.722.1 fullband extension optimization/characterization
         phase", ITU-T Study Group 16, Temporary Document TD 322 (WP
         3/16), 22 April - 2 May 2008.





Hoene                 Expires December 3, 2011               [Page 45]


Internet-Draft              Codec Quality                    June 2011


   [34] C. Lamblin, R. Even, "G.722.1 fullband extension
         characterization phase test results: objective (ITU-R BS.1387-
         1) and subjective (ITU-R BS.1116) scores", ITU-T Study Group
         16, Temporary Document TD 341 R1 (WP 3/16), 22 April - 2 May
         2008.

   [35] C. Lamblin, R. Even, "G.722.1 fullband extension
         optimization/characterization Quality Assessment Test Plan",
         ITU-T Study Group 16, Temporary Document TD 323 (WP 3/16), 22
         April - 2 May 2008.

   [36] J. Lee, J. Kim, C. Jang, S. Kim, B. Egger, K. Kim, S Han,
         "FaCSim: A Fast and Cycle-Accurate Architecture Simulator for
         Embedded Systems", in Proceedings of the International
         Conference on Languages, Compilers, and Tools for Embedded
         Systems (LCTES'08), Tucson, Arizona, USA, June 2007, Software
         available at http://facsim.snu.ac.kr/.

   [37] G. Maier, A. Feldmann, V. Paxson, M. Allman, "On Dominant
         Characteristics of Residential Broadband Internet Traffic",
         IMC'09, November 4-6, 2009, Chicago, Illinois, USA.

   [38] T. Mori, S.  Naito, R. Kawahara, S. Goto, "On the
         characteristics of internet traffic variability: Spikes and
         Elephants", SAINT'04, 2004.

   [39] Open Mobile Alliance, "Push to talk over Cellular
         Requirements", Approved Version 1.0, 09 Jun 2006, OMA-RD-PoC-
         V1_0-20060609-A.pdf

   [40] OPTICOM, SwissQual, TNO, "Announcement of OPTICOM, SwissQual
         and TNO to submit a joint P.OLQA model", ITU-T SG 12,
         Contribution 117, Received on 2010-05-07. Related to
         question(s): Q9/12.

   [41] D. Sisalem, A. Wolisz, "Towards TCP-friendly adaptive
         multimedia applications based on RTP", IEEE International
         Symposium on Computers and Communications, pp. 166-172, 1999.

   [42] S. Smirnoff, K. Pupkov, "SoundExpert, How it Works, Audio
         quality measurements in the digital age",
         http://soundexpert.org/, revived Nov. 2010.

   [43] L. Sun, "Speech Quality prediction For Voice Over Internet",
         PhD thesis, University of Plymouth, January 2004,
         http://www.tech.plymouth.ac.uk/spmc/people/lfsun/mos/.



Hoene                 Expires December 3, 2011               [Page 46]


Internet-Draft              Codec Quality                    June 2011


   [44] Texas Instruments, "C64x+ CPU Cycle Accurate Simulator",
         October 2010,
         http://processors.wiki.ti.com/index.php/C64x%2B_CPU_Cycle_Accu
         rate_Simulator.

   [45] Texas Instruments, "TNETV3020: Carrier Infrastructure
         Platform, Telogy Software products integrated with TI's DSP-
         based high-density communications processor", 2008,
         http://focus.ti.com/lit/ml/spat174a/spat174a.pdf

   [46] TransNexus, "Asterisk V1.4.11 Performance", webpage, accessed
         Nov. 2010,
         http://www.transnexus.com/White%20Papers/asterisk_V1-4-
         11_performance.htm

   [47] K. Vos, K. Vandborg Sorensen, S. Skak Jensen, J. Spittka,
         "SILK", presentation at the 77th IETF meeting in the WG Codec,
         March 22, 2010, Anaheim, USA.
         http://tools.ietf.org/agenda/77/slides/codec-3.pdf

   [48] H. Vlad Balan, L. Eggert, S. Niccolini, M. Brunner, "An
         Experimental Evaluation of Voice Quality Over the Datagram
         Congestion Control Protocol," IEEE INFOCOM 2007. 26th IEEE
         International Conference on Computer Communications. pp. 2009-
         2017, 6-12 May 2007.

   [49] J. Wallerich, A. Feldmann, "Capturing the Variability of
         Internet Flows Across Time", Proceedings INFOCOM 2006. 25th
         IEEE International Conference on Computer Communications, 23-
         29 April 2006.

   [50] M. Westerlund, "How to Write an RTP Payload Format", work in
         progress, draft-ietf-avt-rtp-howto-06, Internet-draft,
         March 2, 2009.

   [51] Wikipedia contributors, "Bit rate", Wikipedia, The Free
         Encyclopedia, 10 October 2010, 20:00 UTC,
         http://en.wikipedia.org/w/index.php?title=Bit_rate&oldid=38993
         1944

   [52] Wikipedia contributors, "Cycle accurate simulator", Wikipedia,
         The Free Encyclopedia, 4 September 2010, 14:27 UTC,
         http://en.wikipedia.org/w/index.php?title=Cycle_accurate_simul
         ator&oldid=382876676





Hoene                 Expires December 3, 2011               [Page 47]


Internet-Draft              Codec Quality                    June 2011


   [53] Wikipedia contributors, "Latency (engineering)", The Free
         Encyclopedia, 15 October 2010, 23:54 UTC,
         http://en.wikipedia.org/w/index.php?title=Latency_(engineering
         )&oldid=390971153

   [54] Wikipedia contributors, "Profiling (computer programming)",
         Wikipedia, The Free Encyclopedia, 15 August 2010, 03:57 UTC,
         http://en.wikipedia.org/w/index.php?title=Profiling_(computer_
         programming)&oldid=378987422.

   [55] M. T. Yourst, "PTLsim: A cycle accurate full system x86-64
         microarchitectural simulator", in ISPASS '07, 2007, software
         available at http://www.ptlsim.org/.

14. Acknowledgments

   This document is based on many discussions with experts in the field
   of codec design, quality of experience and quality management. My
   special thanks go to Michael Knappe, Sebastian Moeller, Raymond
   Chen, Jack Douglass, Paul Coverdale, Jean-Marc Valin, Koen Vos,
   Bilke Ullrich, and all active participants of the Codec WG mailing
   list. Also, I like to express my appreciation to the members of the
   ITU-T study groups 12 and 16, with whom I had many fruitful
   discussions.
























Hoene                 Expires December 3, 2011               [Page 48]


Internet-Draft              Codec Quality                    June 2011


Authors' Addresses

   Christian Hoene
   Universitaet Tuebingen
   WSI-ICS
   Sand 13
   72076 Tuebingen
   Germany

   Phone: +49 7071 2970532
   Email: hoene@uni-tuebingen.de





































Hoene                 Expires December 3, 2011               [Page 49]