Network Working Group T. Terriberry Internet-Draft Mozilla Corporation Intended status: Informational July 6, 2015 Expires: January 7, 2016 Overlapped Block Motion Compensation for NETVC draft-terriberry-netvc-obmc-00 Abstract This document proposes a scheme for overlapped block motion compensation that could be incorporated into a next-generation video codec. The scheme described is that currently used by Xiph's Daala project, which supports variable block sizes without introducing any discontinuities at block boundaries. This makes the scheme suitable for use with lapped transforms or other techniques where encoding such discontinuities is expensive. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 7, 2016. Copyright Notice Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Terriberry Expires January 7, 2016 [Page 1]

Internet-Draft Coding Tools July 2015 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Adaptive Subdivision OBMC . . . . . . . . . . . . . . . . . . 3 3. Implementation and Motion Estimation . . . . . . . . . . . . 7 3.1. Initial Estimates . . . . . . . . . . . . . . . . . . . . 10 3.2. Adaptive Subdivision . . . . . . . . . . . . . . . . . . 10 3.3. Iterative Refinement . . . . . . . . . . . . . . . . . . 12 3.3.1. Rate and Distortion Changes . . . . . . . . . . . . . 12 3.3.2. Complexity Reduction . . . . . . . . . . . . . . . . 13 3.3.3. Subpel Refinement . . . . . . . . . . . . . . . . . . 14 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1. Informative References . . . . . . . . . . . . . . . . . 15 4.2. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 16 1. Introduction Most current video codecs still use straightforward Block Matching Algorithms (BMA) to perform motion compensation, despite their simplicity. These algorithms simply copy a block of pixels from a reference frame, possibly after applying a sub-pixel (subpel) filter to allow for increased motion resolution. When the motion vectors (MVs) of two adjacent blocks differ, a discontinuity is likely to be created along the block edge. These discontinuities are expensive to correct with transform stages that do not themselves have discontinuities along block edges, such as lapped transforms [I-D.egge-videocodec-tdlt]. Even with a more traditional block-based DCT as the transform stage, the creation of these discontinuities requires that some residual be coded to correct them (and to activate loop filtering along these edges) and requires that the size of a transform block used to code that residual be no larger than the size of a prediction block (or they will suffer the same efficiency problem as lapped transforms in correcting them) Overlapped Block Motion Compensation (OBMC) avoids discontinuities on the block edges by copying slightly larger blocks of pixels, and blending them together with those of neighboring blocks, in an overlapping fashion. Under the assumption that pixels in the reference frames are highly spatially correlated, this blending compensates for motion uncertainty at the pixels farthest from the estimated MVs. This improves the accuracy of the prediction near block edges, making the expected error more uniform across a block, and improving coding performance over a similar BMA scheme (with Terriberry Expires January 7, 2016 [Page 2]

Internet-Draft Coding Tools July 2015 fixed-size blocks) by 0.4 dB [KO95] to 1.0 dB [KK97], depending on the search strategy used. Non-overlapped BMA schemes that support varying the block size give much better compression than fixed-size schemes [Lee95]. Although previous formats such as Dirac use OBMC, it has always been with a (small) fixed blending window size. The size of a block might vary from, say, 4x4 to 16x16 pixels, with each block given a single MV, but the overlap with neighboring blocks remains fixed, limited by the smallest supported block size to, say, 2 pixels on either side of an edge (the exact sizes in Dirac are configurable, but do not vary within a frame). This is essentially equivalent to performing prediction for the entire frame at the smallest block size (4x4) with an efficient scheme for specifying that the same MV be used for many of the blocks. We propose a subdivision scheme for OBMC that supports adaptive blending window sizes, allowing much larger blending windows in blocks that are not subdivided, which previous research has suggested should improve prediction performance compared to fixed-size windows [ZAS98]. Our scheme uses simple window functions that can be computed on the fly, rather than stored in large tables, allowing it to scale to very large block sizes. It admits a dynamic programming algorithm to optimize the subdivision levels with a reasonable complexity. 2. Adaptive Subdivision OBMC Traditional BMA schemes and previous OBMC schemes have a one-to-one correspondence between blocks and MVs, with each block having a single MV. That MV is most reliable in the center of the block, where the prediction error is generally the smallest [ZSNKI02]. Instead, we use a grid of MVs at the corners of blocks. With a fixed-size grid, away from the edges of a frame, this difference is mostly academic: equivalent to shifting block boundaries by half the size of a block in each direction. However, with variable-sized blocks, the distinction becomes more relevant: there is no longer a one-to-one correspondence between blocks and MVs. Under the scheme where MVs correspond to the center of a block, splitting a block removes the old MV at the center of the old block and produces new MVs at the centers of the new blocks. Under the scheme where MVs belong to the corners, splitting a block retains the MVs at the existing corners (corresponding to the same motion as before), but may add new MVs at the new block corners. We use square blocks with an origin in the upper-left corner and x and y coordinates that vary between 0 and 1 within a block. The Terriberry Expires January 7, 2016 [Page 3]

Internet-Draft Coding Tools July 2015 vertices and edges of a block are indexed in a clockwise manner, as illustrated in Figure 1. mv[0] mv[1] \ x ---> / \ / 0---------------1 y | 0 | | | | | | v |3 1| mv[3] | mv[2] | \ | \ | \| 2 \| 3---------------2 Figure 1: Vertex and Edge Indices for a Block In a block with MVs at all four corners, we use normal bilinear weights to blend the predictions from each MV. The bilinear weights for each vertex, w[k], at a pixel location (x, y) are defined as w[0] = (1 - x)*(1 - y) w[1] = x*(1 - y) w[2] = x*y w[3] = (1 - x)*y Let "I" be the reference image, and for simplicity denote the predictor I[x + mv[k].x, y + mv[k].y] for the pixel at location (x, y) with motion vector mv[k] as simply I(mv[k]). In a regular grid, with unique motion vectors defined at all four corners of a block, we predict the interior of the block using I(mv[0])*w[0] + I(mv[1])*w[1] + I(mv[2])*w[2] + I(mv[3])*w[3] In order to extend OBMC to handle variable block sizes while maintaining continuity along the edges, we start by imposing the restriction that the size of two adjacent blocks differ by at most a factor of two, which greatly simplifies the problem. To do this, we borrow a data structure from the related graphics problem of surface simplification, the semi-regular 4-8 mesh [VG00]. This is normally used to represent subdivisions in a triangle mesh, but we use it for variable-sized quadrilaterals. Terriberry Expires January 7, 2016 [Page 4]

Internet-Draft Coding Tools July 2015 H 0---------------0 0---------------0 0-------2-------0 | | | : _/ | | H | | | | : __/ | | H | | | | :/ | | H | | | |.......1.......| =2=======1=======2= | | | __/: | | H | | | | _/ : | | H | | | | / : | | H | 0---------------0 0---------------0 0-------2-------0 H Level 0 Level 1 Level 2 H H 0-------2-------0 0---4---2---4---0 | :/ : \_: | | H : H | |...3...:...3...| =4===3===4===3===4= | / : : :\_ | | H : H | 2...:...1...:...2 2...4...1...4...2 | \_: : : / | | H : H | |...3...:...3...| =4===3===4===3===4= | :\_ : / : | | H : H | 0-------2-------0 0---4---2---4---0 H H Level 3 Level 4 The first four subdivision levels in a 4-8 mesh. Numbers indicate vertices with transmitted MVs. Diagonal lines (on odd levels) and double lines (on even levels) connect each new vertex to its two parents at the previous level (in some cases, this parent may lie in an adjacent block). Dotted lines indicate additional block boundaries. Figure 2: Subdivision Levels in a 4-8 Mesh Subdivision in a 4-8 mesh proceeds in two phases, as illustrated in Figure 2. In the first phase, a new vertex is added to the center of a quadrilateral. This subdivides the quadrilateral into four "quadrants", but does not add any aditional vertices to the edges. Such edges are called "unsplit". In the second phase, each of the quadrilateral's edges may be split and connected to the center vertex, forming four new quadrilaterals. One useful property of this two-phase subdivision is that the number of vertices in the mesh merely doubles during each phase, instead of quadrupling as it would under normal quadtree subdivision. This provides more fine-grained control over the number of MVs transmitted. Terriberry Expires January 7, 2016 [Page 5]

Internet-Draft Coding Tools July 2015 To ensure that the size of two adjacent blocks differs by no more than a factor of two, we assign every vertex two parents in the mesh, which are the two adjacent vertices from the immediately preceding subdivision level. A vertex may not be added to the mesh until both of its parents are present. That is, a level 2 vertex may not be added to an edge until the blocks on either side have had a level 1 vertex added, and a level 3 vertex may not be added to the center of a block until both of the level 2 vertices have been added to its corners, and so on. Therefore, we need only specify how to handle a block that has undergone phase one subdivision, but still has one or more unsplit edges, as illustrated in Figure 3. Such a block is divided into quadrants, and each is interpolated separately using a modified version of the previous bilinear weights. c mv[0] mv[1] 0------------------------------1 mv[0] | mv[0] # mv[1] : mv[0] mv[1] | |###############: | |###############: | |###############: | |###############: | |###############: | | mv[3] # mv[4] : mv[4] mv[5] | |...............4--------------5 | mv[0] mv[4] | | | | | | | | | | | | | | | | | mv[3] | mv[3] mv[6] | | 3---------------6--------------2 Figure 3: Interpolation Setup for Unsplit Edges The same two MVs are used along the unsplit edge(s) as before, but we shift some of the weight used for blending from the middle of the edge to the exterior corner. More precisely, the weights w[k] are replaced by modified weights s[k]. For example, if c is the index of the vertex in the exterior corner, (+) denotes addition modulo four, and c (+) 1 is the index of the corner bisecting the unsplit edge (the top edge in the figure), then s[c] = w[c] + 0.5*w[c (+) 1] s[c (+) 1] = 0.5*w[c (+) 1] Terriberry Expires January 7, 2016 [Page 6]

Internet-Draft Coding Tools July 2015 The remaining weights are unchanged. A similar modification is used if it is c (+) 3 that lies on the unsplit edge. The modifications are cumulative. That is, if both c (+) 1 and c (+) 3 lie on unsplit edges (as in the hashed region in Figure 3), s[c] = w[c] + 0.5*w[c (+) 1] + 0.5*w[c (+) 3] s[c (+) 1] = 0.5*w[c (+) 1] s[c (+) 3] = 0.5*w[c (+) 3] s[c (+) 2] = w[c (+) 2] This defintiion of the blending weights clearly matches an adjacent block along an unsplit edge, regardless of whether or not that block has been split. Careful examination will verify that it also matches other quadrants along the interior edges. Each weight can be evaluated with finite differences at the cost of one add per pixel, plus setup overhead. The blending can be done with three multiplies per pixel by taking advantage of the fact that the weights sum to one, just as with regular bilinear interpolation. The mesh itself may require more vertices than an unconstrained mesh to achieve a given level of subdivision in a local area, but requires fewer bits to encode the subdivision itself, simply because there are fewer admissable meshes. As long as a (0, 0) MV residual can be efficiently encoded, the worst-case rate of the 4-8 mesh should be close to that of a similar, unconstrained mesh. This process can be extended to handle blocks that differ by more than one level of subdivision, so long as the edge between them remains entirely unsplit. For example, to handle block sizes that differ by a factor of four, instead of shifting half the blending weight from one vertex to the other, one simply needs to shift 1/4, 1/2, or 3/4 of the weight, depending on the location of the block along the unsplit edge. However, the 4-8 mesh is no longer suitable for describing which vertices can appear in the mesh, and some modifications ot the adaptive subdivision algorithm in Section 3.2 are required. We have not yet implemented these extensions. 3. Implementation and Motion Estimation The algorithms in Section 2 have been implemented in the Daala video codec [Daala-website]. We use them to produce a complete "motion compensated reference frame", which is then lapped and transformed (in both the encoder and decoder) [I-D.egge-videocodec-tdlt] to make it available as a frequency-domain predictor for the transform stage [I-D.valin-netvc-pvq]. The full source code, including all of the OBMC work described in this draft is available in the project git repository at [1]. Terriberry Expires January 7, 2016 [Page 7]

Internet-Draft Coding Tools July 2015 Luma blocks are square with sizes ranging from 32x32 to 4x4. The corners of the MV blocks are aligned with the corners of the transform blocks. An earlier design had the MV blocks offset from the transform blocks, so that MVs remained in the center of the transform blocks at the coarsest level, with an extra ring of implicit (0, 0) MVs around the frame (to keep the minimum number of transmitted MVs the same as with BMA). However, we found that there was essentially no performance difference between the two approaches (see commit 461310929fc5). Some things are simpler with the current approach (all of the special cases for the implicit (0, 0) MVs go away), and some things are more complicated, but most of the complications are confined to the computation of MV predictors. The encoder performs rate-distortion (R-D) optimization during motion estimation to balance the prediction error (D) attainable against the number of bits required to achieve it (R), e.g., minimizing J = D + lambda*R The value of lambda is obtained directly from the target quantizer setting. We use the Sum of Absolute Differences (SAD) in the luma plane as our distortion metric for the first three stage of the search, and use the Sum of Absolute Transformed Difference (SATD) during a final subpel refinement stage (with an appropriate adjustment to lambda). We approximate the rate of small MV components (with a magnitude less than 3 after subtracting a predictor) using statistics from the previous frame, plus one sign bit for each non-zero component. Larger MV components have an additional non-adaptive rate cost added that increases logarithmically with the MV magnitude. The rate term for each vertex also includes a bit for each flag that indicates the presence of a child (2 per vertex on average). The real motion information is adaptively arithmetic encoded [I-D.terriberry-netvc-codingtools], but these approximations avoid having to update the rate term for every MV every time a single MV is changed. We use a median-of-four predictor for almost every MV, as illustrated in Figure 4. The middle two values of each MV component are averaged, rounding towards even values. There are two exceptions. If an MV required for prediction lies outside the frame, a (0, 0) MV is substituted in its place. If an MV required for prediction lies inside a 32x32 block that comes after the current one in raster order, then that MV is ignored and we use the median of the three remaining MVs. This occurs when predicting MVs on even levels that lie on the right or bottom edges of a 32x32 block. MVs on the top Terriberry Expires January 7, 2016 [Page 8]

Internet-Draft Coding Tools July 2015 and left edges of the frame are considered to belong to the 32x32 block below or to the right, respectively (that is, the corresponding MV in that block is not ignored). | | | | | | ----O---------------O--------------O |\__ H __/| | \_ H _/ | | \_ H _/ | | \_ H _/ | | \__ H __/ | | \ H / | | _|VL | ----O==============>X--------------+ Level 0 H O-------+-------O +-------O-------+ | \_ | _/ | | H | | \__ | __/ | | H | | \|/ | | H | +-------X-------+ =O=======X=======O= | __/|\__ | | H | | _/ | \_ | | H | | / | \ | | H | O-------+-------O +-------O-------+ H Odd Levels Even Levels Key: X - MV being predicted O - MV used for prediction. Except at level 0, these are all ancestors of the MV being predicted, and thus are required to be present. + - MV grid point not used for prediction (might not be coded) Figure 4: The Predictors Used for MVs at Each Level The current bitstream encodes MVs level-by-level for the entire frame. It is expected at some point that this will be migrated to code all the MVs for a single 32x32 block at a time. This is the reason for excluding predictors outside of the current 32x32 block. The number of combinations of subdivision levels and MVs available make finding a globally optimal set of parameters impractical. The problem of finding optimal subdivision levels alone is known to be NP-hard [AS94]. The estimation procedure outlined below attempts to Terriberry Expires January 7, 2016 [Page 9]

Internet-Draft Coding Tools July 2015 balance speed with compression performance, though it could certainly be improved with future research. 3.1. Initial Estimates First we produce an initial MV estimate at each point in the fully- subdivided grid using BMA. We compute several MV candidates from spatial and temporal neighbors and assuming constant speed or acceleration. The candidates are grouped into sets by reliability and the search terminates early if the best candidate from a set has a SAD below a dynamically chosen threshold. Otherwise, a local gradient search is performed using a square pattern around the best candidate vector. The thresholds ensure extra computation time is spent only on blocks whose predictor can be reasonably expected to improve. Although we look solely at SAD to determine whether to continue the search, the candidates themselves are ranked using the full R-D cost metric, J. Level 0 searches using (non-overlapped) 32x32 blocks centered on the corresponding grid points, while the next two levels use 16x16 blocks, the next two levels 8x8, and so on. MVs are estimated from the coarsest levels to the finest, to allow for the accurate computation of MV predictors used in the rate estimates. As the mesh is subdivided, existing grid points do not have their MVs re- estimated with smaller block sizes, even though the area that those MVs would influence in a grid subdivided to that level is reduced. All MVs are estimated only up to whole-pel accuracy at this stage. 3.2. Adaptive Subdivision The second stage of motion estimation fixes the mesh subdivision. During this stage, the SAD for each block is computed using full OBMC instead of BMA. The MVs produced in the previous stage are held fixed in this one. Only the mesh subdivision level changes. The extra subdivision required to add a vertex to the 4-8 mesh is similar to the implicit subdivision used by Zhang et al. in their variable block size OBMC scheme [ZAS98]. The difference is that we optimize over and encode such subdivision explicitly. We use a global R-D optimization strategy with general mesh decimations, as proposed by Balmeli [Bal01]. This is a greedy approach that starts with a full mesh and successively decimates vertices. Restricting decimation candidates to the leaves of the mesh can frequently produce sequences where decimating a MV (reducing rate) causes distortion to go _down_, clearly indicating that the previous rate allocation was not optimal. General mesh decimations, on the other hand, allow any MV to be removed at a given step, not just the leaves. If a non-leaf is decimated, all of its children are Terriberry Expires January 7, 2016 [Page 10]

Internet-Draft Coding Tools July 2015 decimated as well. This helps smooth out non-monotonicities in the distortion measure during the decimation process, especially at low rates The following notation is used to describe the algorithm. The current mesh is denoted by M, and M_v is the "merging domain" of v in M: the set of vertices in M that must be removed to remove v. This includes v and all of its undecimated children. Additionally, the variation dU(M_v) contains the pairs (dD(M_v), dR(M_v)): the change in distortion (SAD) and rate (bits) caused by removing M_v from M. We also refer to the change in SAD in a single block b caused by removing a single vertex v as dD_b(v). Finall, A_v is the set of ancestors of v in M. Some minor additions to Balmelli's original algorithm are made to handle the fact that distortion is measured over squares, not triangles. The steps of the algorithm are: 1. For all v, compute dU(M_v). 2. Do (a) Let v* be the value of v in M for which -dD(M_v)/dR(M_v) is the smallest. (b) If -dD(M_v*)/dR(M_v*) > lambda, stop. (c) For all w in M_v*, sorted by depth from deepest to shallowest: i. For all a in A_w, subtract dU(M_w) from dU(M_a). ii. Remove w from the mesh. iii. If w was on an even level, then for each adjacent block b with a w' in M such that w' lies on the same level as w: A. Let d be change in dD_b(w') before and after decimating w. B. For all w in {w'} U A_w' \ A_w, add d to dD(M_a). These steps ensure that dU(M_v) contains the up-to-date changes in the global rate and distortion after each merging domain is decimated. This update process properly accounts for overlapping merging domains due to an inclusion-exclusion principle. See Balmelli for details [Bal01]. Step 2(c)iii handles the case of decimating one corner of a block, w, when the opposite corner, w', remains. This changes dD_b(w'), the cost of decimating the opposite Terriberry Expires January 7, 2016 [Page 11]

Internet-Draft Coding Tools July 2015 corner, and that change must be propagated to each merging domain to which w' belongs. No change needs to be made to the common ancestors of w and w' however: once dD(M_w') is updated, the normal update process that will be executed when w' is decimated is sufficient. The addition of these extra steps does not affect the computational complexity of the algorithm which is Theta(n log n), where n is the size of the initial mesh. The distortion measurements needed to initialize and update dU(M_v) can be computed once, in advance, by computing the SAD value of each block in all sizes and with all possible combinations of unsplit edges. All told, each pixel in the image is used in exactly 13 SAD computations (one for the largest block size, with no unsplit edges, and for for each additional block size). Also, since the mesh only undergoes six levels of subdivision, there are only a small number of unique merging domains and ancestor sets. These can be computed offline and stored in tables to simplify the decimation process. To compute the set difference A_w' \ A_w, we note that w and w' share a single common parent, p. The common ancestors of w and w' are now formed by the set {p} U A_p, meaning one can add d to the nodes in A_w' and then subtract it from the nodes in {p} U A_p to effect the set difference in Step 2(c)iiiB. Alternatively, one could simply use a larger set of lookup tables. 3.3. Iterative Refinement The next stage uses the iterated dynamic programming (DP) proposed by Chen and Willson to refine the MVs, accounting for their interdependencies [CW00]. In this scheme, a single row (resp. column) of MVs is optimized at a time using a Viterbi trellis [For73], while the rest remain fixed. If there is no direct block edge between two consecutive MVs in a row (resp. column) then the trellis stops, and a new one is started. This continues until the entire row (resp. column) has been examined. The process is then repeated until the total change in Lagrangian cost, J, falls below a given threshold. 3.3.1. Rate and Distortion Changes We use the change in rate and distortion to compute the cost of each path in the trellis. A single MV can influence the distortion of as many as 12 neighboring blocks. Only the ones to the left (resp. above) are added to the current cost of each path. When the following MV is chosen, an additional 2 to 8 blocks may be added. If necessary, the blocks to the right (resp. below) are added after the last MV in the trellis. Terriberry Expires January 7, 2016 [Page 12]

Internet-Draft Coding Tools July 2015 Unfortunately, the rate of a MV depends on the values of the MVs used to predict it. Chen and Willson assume MVs use 1-D differential coding, as in MPEG-1. With our prediction scheme, several (not necessarily consecutive) MVs on the DP path may be used to predict a given MV, and the corresponding change in rate is not known until a MV has been chosen for all of them. If we were to consider all possible combinations of candidates for the predictors, the number of trellis edges would increase by several orders of magnitude. This seems excessively wasteful, since as long as the changes to the MVs are small, the median operation ensures only one or two of them are likely to have any influence on the predicted value in the first place. Instead, we immediately compute the rate change in each predicted vector---excluding those that themselves lie further along the DP path, since we do not yet know what MV will be encoded. We do this assuming all MVs not already considered by the DP remain fixed, and add the change to the cost of the current path. If changing a subsequent MV on the path causes the rate of one of these predicted MVs to change again, the new rate change is used from then on. Because we essentially discard a large number of trellis states of limited utility, we might theoretically discard the path that does not change any MVs, even though its true cost is lower than the ones we keep. Thus, as a safety precaution, we check the final cost of the best path, and do not apply it if it is greater than zero. This does occur in practice, but very rarely. Other, simpler alternatives to this approach are also possible. For example, we tried only considering rate changes for MVs on the actual DP path, which is much like Chen and Willson's approach. However, on frames with complex motion, we have seen dramatic improvements in visual quality and motion field smoothness by properly accounting for all rate changes. This is because a level 0 MV, for example, may be used to predict up to 24 other MVs, only 8 of which lie on a given DP path. In a dense mesh, the rate changes off the path may dominate the ones on it. 3.3.2. Complexity Reduction Chen and Willson showed that using a logarithmic search instead of an exhaustive one for the DP resulted in an average PSNR loss of only 0.05 dB and an average MV bitrate increase of 55 bits per frame. We take an even more aggressive approach, and replace the logarithmic search with a diamond search. Because the complexity of a given DP chain increases quadratically in the number of MV candidates at each node, reducing the candidate count can give a substantial performance boost. Terriberry Expires January 7, 2016 [Page 13]

Internet-Draft Coding Tools July 2015 The logarithmic search uses a 9-candidate square pattern in each stage. The displacements used in the pattern shrink by a factor of two in each phase. Chen and Willson used a 3-phase search to achieve an effective search radius of +/- 7x7 pixels. In our framework, this requires examining 3x9**2 = 243 trellis edges per MV for one full iteration. However, the large displacement patterns only alter a small number of MVs that have usually been grossly mis-estimated. They are even likely to cause further mis-estimation in areas with repeated structure or lack of texture, which makes further refinement less effective. One alternative is to simply discard them, and only perform the square pattern search with one-pixel displacements. The chance of being trapped in a local minimum is increased, but three times as many iterations can be performed in the same amount of time. On the other hand, a small diamond search pattern has only 5 candidates, making it even more attractive. This allows more than nine times as many iterations as the full logarithmic search in the same amount of time. We support using the diamond search pattern, the square search pattern, and the square search pattern with logarithmic step sizes with various complexity settings, but by default run with only the diamond pattern with a single step size. The logarithmic square pattern search saves between 0.6% and 1.2% rate on metrics, but adds 50% to the total encode time. The computational complexity of this iterative refinement is still relatively high. In a single iteration, each of the four edges of a block are traversed exactly once by a DP path, during which its SAD is evaluted 25 times, for a total of 100 SAD calculations per block. This is nearly as many as full search BMA with a +/- 6x6 window, and computing our blended predictors already has higher complexity. Thus it is not suitable for a real-time implementation, but it can easily be disabled, or even lighter-weight versions designed. 3.3.3. Subpel Refinement The same samll diamond-pattern search can be used to refine the motion vectors to subpel precision. A square pattern is also supported at the highest complexity level, and saves an additional 0.5% to 0.7% on bitrate, but is half the speed of the default settings. Our implementation supports half-, quarter-, or eigth-pel resolution MVs. First, the DP process is iterated with the small diamond and half-pel displacements until the change in Lagrangian cost, J, for an iteration falls below a given threshold. Finer resolutions are only used if they provide an overall R-D benefit, which is tested on a frame-by-frame basis. First, iteration is done with quarter-pel displacements, followed, if successful, by Terriberry Expires January 7, 2016 [Page 14]

Internet-Draft Coding Tools July 2015 eigth-pel. If the decrease in SAD from the finer resolution MVs cannot balance the (approximately) 2 bit per MV increase in bitrate, then the coarser vectors are used instead. Subpel interpolation is performed using a separable 6-tap polyphase filter bank. Only eight filters are currently used, one for each subpel offset at eigth-pel resolution. If chroma is decimated (for 4:2:0 video) and eigth-pel MVs are used, then the MV is divided by two and rounded to the nearest even value to select an appropriate subpel filter. 4. References 4.1. Informative References [I-D.egge-videocodec-tdlt] Egge, N. and T. Terriberry, "Time Domain Lapped Transforms for Video Coding", draft-egge-videocodec-tdlt-01 (work in progress), March 2015. [I-D.terriberry-netvc-codingtools] Terriberry, T., "Coding Tools for a Next Generation Video Codec", draft-terriberry-netvc-codingtools-00 (work in progress), June 2015. [I-D.valin-netvc-pvq] Valin, J., "Pyramid Vector Quantization for Video Coding", draft-valin-netvc-pvq-00 (work in progress), June 2015. [AS94] Agarwal, P. and S. Suri, "Surface Approximation and Geometric Partitions", Proc. of the 5th ACM-SIAM Symposium on Discrete Algorithms (SODA'94) pp. 24--33, January 1994. [Bal01] Balmelli, L., "Rate-Distortion Optimal Mesh Simplification for Communications", PhD thesis, Ecole Polytechnique Federale de Lausanne, Switzerland, 2001. [CW00] Chen, M. and A. Willson, "Motion-Vector Optimization of Control Grid Interpolation and Overlapped Block Motion Compensation Using Iterated Dynamic Programming", IEEE Transactions on Image Processing 9(7):1145--1157, July 2000. [For73] Forney, G., "The Viterbi Algorithm", Proceedings of the IEEE 61(3):268--278, March 1973. Terriberry Expires January 7, 2016 [Page 15]

Internet-Draft Coding Tools July 2015 [KO95] Katto, J. and M. Ohta, "An Analytical Framework for Overlapped Motion Compensation", Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP'95) vol. 4, pp. 2189--2192, May 1995. [KK97] Kuo, T. and C. Kuo, "A Hybrid BMC/OBMC Motion Compensation Scheme", Proc. of the International Conference on Image Processing (ICIP'97) vol. 2, pp. 795--798, October 1997. [Lee95] Lee, J., "Optimal Quadtree for Variable Block Size Motion Estimation", Proc. of the IEEE International Conference on Image Processing vol. 3, pp. 480--483, October 1995. [VG00] Velho, L. and J. Gomes, "Variable Resolution 4-k Meshes: Concepts and Applications", Computer Graphics Forum 19(4):195--212, December 2000. [ZAS98] Zhang, J., Ahamd, M., and M. Swamy, "New Windowing Techinques for Variable-Size Block Motion Compensation", IEE Proceedings---Vision, Image, and Signal Processing 145(6):399--407, December 1998. [ZSNKI02] Zhen, W., Shishikui, Y., Naemure, M., Kanatsugu, Y., and S. Itoh, "Analysis of Space-Dependent Characteristics of Motion-Compensated Frame Differences Based on a Statistical Motion Distribution Model", IEEE Transactions on Image Processing 11(4):377--386, April 2002. [Daala-website] "Daala website", Xiph.Org Foundation , <https://xiph.org/ daala/>. 4.2. URIs [1] https://git.xiph.org/daala.git Author's Address Timothy B. Terriberry Mozilla Corporation 331 E. Evelyn Avenue Mountain View, CA 94041 USA Phone: +1 650 903-0800 Email: tterribe@xiph.org Terriberry Expires January 7, 2016 [Page 16]