rfc6716-Definition of the Opus Audio Codec-翻译(1)

83 阅读43分钟

1. Introduction

The Opus codec is a real-time interactive audio codec designed to meet the requirements described in [REQUIREMENTS]. It is composed of a layer based on Linear Prediction (LP) [LPC] and a layer based on the Modified Discrete Cosine Transform (MDCT) [MDCT]. The main idea behind using two layers is as follows: in speech, linear prediction techniques (such as Code-Excited Linear Prediction, or CELP) code low frequencies more efficiently than transform (e.g., MDCT) domain techniques, while the situation is reversed for music and higher speech frequencies. Thus, a codec with both layers available can operate over a wider range than either one alone and can achieve better quality by combining them than by using either one individually.

Opus 编解码器是一款实时交互式音频编解码器,其设计旨在满足 [REQUIREMENTS] 文档中所描述的需求。它由两个技术层组成:一个基于线性预测(LP) [LPC],另一个基于改进的离散余弦变换(MDCT) [MDCT]。

采用这种双层结构的主要思路如下:对于语音信号,线性预测技术(例如码激励线性预测,即 CELP)在编码低频部分时,比变换域技术(例如 MDCT)更为高效;而对于音乐以及语音中的高频部分,情况则正好相反。

因此,一个同时具备这两种技术层的编解码器,其工作范围会比仅使用其中任何一种时更广,并且通过将两者结合使用,能够比单独使用任一技术层获得更好的音质。

The primary normative part of this specification is provided by the source code in Appendix A. Only the decoder portion of this software is normative, though a significant amount of code is shared by both the encoder and decoder. Section 6 provides a decoder conformance test. The decoder contains a great deal of integer and fixed-point arithmetic that needs to be performed exactly, including all rounding considerations, so any useful specification requires domain-specific symbolic language to adequately define these operations. Additionally, any conflict between the symbolic representation and the included reference implementation must be resolved. For the practical reasons of compatibility and testability, it would be advantageous to give the reference implementation priority in any disagreement. The C language is also one of the most widely understood, human-readable symbolic representations for machine behavior. For these reasons, this RFC uses the reference implementation as the sole symbolic representation of the codec.

本规范的主要规范性部分由附录 A 中的源代码提供。该软件中只有解码器部分是规范性的,尽管编码器和解码器共享了大量的代码。第 6 节提供了一个解码器一致性测试。

解码器包含大量的整数和定点运算,这些运算要求被精确执行(包括所有关于舍入的考量)。因此,任何有用的规范都需要一种领域特定的符号语言来充分定义这些运算。

此外,符号表示与所包含的参考实现之间的任何冲突都必须得到解决。出于兼容性和可测试性等现实原因,在出现任何分歧时,赋予参考实现优先权是更为有利的。C 语言本身也是一种最被广泛理解、人类可读的、用于描述机器行为的符号表示。基于以上原因,本 RFC (文档) 将参考实现作为该编解码器唯一的符号表示

While the symbolic representation is unambiguous and complete, it is not always the easiest way to understand the codec's operation. For this reason, this document also describes significant parts of the codec in prose and takes the opportunity to explain the rationale behind many of the more surprising elements of the design. These descriptions are intended to be accurate and informative, but the limitations of common English sometimes result in ambiguity, so it is expected that the reader will always read them alongside the symbolic representation. Numerous references to the implementation are provided for this purpose. The descriptions sometimes differ from the reference in ordering or through mathematical simplification wherever such deviation makes an explanation easier to understand. For example, the right shift and left shift operations in the reference implementation are often described using division and multiplication in the text. In general, the text is focused on the "what" and "why" while the symbolic representation most clearly provides the "how".

尽管符号表示(指代码)是明确且完整的,但它并非总是理解该编解码器工作原理的最简便方式。出于这个原因,本文档也采用文字描述来介绍编解码器的重要部分,并借此机会阐释其设计中许多出人意料之处背后的基本原理(rationale)。

这些描述力求准确且信息丰富,但普通英语的局限性有时会产生歧义,因此,我们期望读者在阅读时,始终将其与符号表示进行对照。为此,文档中提供了大量对该(代码)实现的引用。

在某些情况下,为了让解释更易于理解,这些文字描述在表述顺序上或通过进行数学简化,会与参考实现有所不同。例如,参考实现中的右移和左移位运算,在文本中常常被描述为除法和乘法

总而言之,文本部分侧重于说明“做什么”(what)和“为什么做”(why),而符号表示则最清晰地提供了“如何做”(how)

1.1.1. min(x,y)

The smallest of two values x and y.

1.1.2. max(x,y)

The largest of two values x and y.

1.1.3. clamp(lo,x,hi)

                 clamp(lo,x,hi) = max(lo,min(x,hi))

With this definition, if lo > hi, then lo is returned.

1.1.4. sign(x)

The sign of x, i.e.,

                                ( -1,  x < 0
                      sign(x) = <  0,  x == 0
                                (  1,  x > 0

1.1.5. abs(x)

The absolute value of x, i.e.,

                         abs(x) = sign(x)*x

1.1.6. floor(f)

The largest integer z such that z <= f.

1.1.7. ceil(f)

The smallest integer z such that z >= f.

1.1.8. round(f)

The integer z nearest to f, with ties rounded towards negative infinity, i.e.,

                       round(f) = ceil(f - 0.5)

1.1.9. log2(f)

The base-two logarithm of f.

1.1.10. ilog(n)

The minimum number of bits required to store a positive integer n in binary, or 0 for a non-positive integer n.

                          ( 0,                 n <= 0
                ilog(n) = <
                          ( floor(log2(n))+1,  n > 0

Examples:

o ilog(-1) = 0

o ilog(0) = 0

o ilog(1) = 1

o ilog(2) = 2

o ilog(3) = 2

o ilog(4) = 3

o ilog(7) = 3

2. Opus Codec Overview

The Opus codec scales from 6 kbit/s narrowband mono speech to 510 kbit/s fullband stereo music, with algorithmic delays ranging from 5 ms to 65.2 ms. At any given time, either the LP layer, the MDCT layer, or both, may be active. It can seamlessly switch between all of its various operating modes, giving it a great deal of flexibility to adapt to varying content and network conditions without renegotiating the current session. The codec allows input and output of various audio bandwidths, defined as follows:

 +----------------------+-----------------+-------------------------+
   | Abbreviation         | Audio Bandwidth | Sample Rate (Effective) |
   +----------------------+-----------------+-------------------------+
   | NB (narrowband)      |           4 kHz |                   8 kHz |
   |                      |                 |                         |
   | MB (medium-band)     |           6 kHz |                  12 kHz |
   |                      |                 |                         |
   | WB (wideband)        |           8 kHz |                  16 kHz |
   |                      |                 |                         |
   | SWB (super-wideband) |          12 kHz |                  24 kHz |
   |                      |                 |                         |
   | FB (fullband)        |      20 kHz (*) |                  48 kHz |
   +----------------------+-----------------+-------------------------+
                                  Table 1

Opus 编解码器的能力可从 6 kbit/s 的窄带单声道语音,扩展至 510 kbit/s 的全频带立体声音乐,其算法延迟范围为 5 毫秒至 65.2 毫秒。

在任何特定时刻,LP 层MDCT 层,或两者皆可处于激活状态。它能够在各种操作模式间无缝切换,这赋予了它极大的灵活性,使其能够适应多变的内容和网络状况,而无需重新协商当前会话。该编解码器支持多种音频带宽的输入和输出。

(*) Although the sampling theorem allows a bandwidth as large as half the sampling rate, Opus never codes audio above 20 kHz, as that is the generally accepted upper limit of human hearing.

尽管采样定理允许带宽高达采样率的一半,但 Opus 从不编码超过 20 kHz 的音频,因为这被公认为是人类听觉的上限

Opus defines super-wideband (SWB) with an effective sample rate of 24 kHz, unlike some other audio coding standards that use 32 kHz. This was chosen for a number of reasons. The band layout in the MDCT layer naturally allows skipping coefficients for frequencies over 12 kHz, but does not allow cleanly dropping just those frequencies over 16 kHz. A sample rate of 24 kHz also makes resampling in the MDCT layer easier, as 24 evenly divides 48, and when 24 kHz is sufficient, it can save computation in other processing, such as Acoustic Echo Cancellation (AEC). Experimental changes to the band layout to allow a 16 kHz cutoff (32 kHz effective sample rate) showed potential quality degradations at other sample rates, and, at typical bitrates, the number of bits saved by using such a cutoff instead of coding in fullband (FB) mode is very small. Therefore, if an application wishes to process a signal sampled at 32 kHz, it should just use FB.

Opus 将超宽带 (Super-Wideband, SWB) 的有效采样率定义为 24 kHz,这与其他一些使用 32 kHz 的音频编码标准不同。做出这一选择是基于多方面原因的考量。

MDCT 层中的频带布局天然支持跳过超过 12 kHz 频率的系数,但却无法纯粹地只丢弃那些超过 16 kHz 的频率。24 kHz 的采样率也使得 MDCT 层中的重采样更加容易,因为 24 可以整除 48。并且,当 24 kHz 的带宽足够时,它还能节省其他处理(如声学回声消除,AEC)的计算量。

为了支持 16 kHz 的截止频率(即 32 kHz 有效采样率)而对频带布局进行的实验性改动表明,这可能会在其他采样率下导致音质劣化。并且,在典型的比特率下,采用这种截止频率方式相比于使用全频带 (Fullband, FB) 模式编码所能节省的比特数微乎其微。

因此,如果应用程序希望处理以 32 kHz 采样的信号,它就应该直接使用 FB(全频带)模式。

The LP layer is based on the SILK codec [SILK]. It supports NB, MB, or WB audio and frame sizes from 10 ms to 60 ms, and requires an additional 5 ms look-ahead for noise shaping estimation. A small additional delay (up to 1.5 ms) may be required for sampling rate conversion. Like Vorbis [VORBIS-WEBSITE] and many other modern codecs, SILK is inherently designed for variable bitrate (VBR) coding, though the encoder can also produce constant bitrate (CBR) streams. The version of SILK used in Opus is substantially modified from, and not compatible with, the stand-alone SILK codec previously deployed by Skype. This document does not serve to define that format, but those interested in the original SILK codec should see [SILK] instead.

LP 层基于 SILK 编解码器 [SILK]。它支持窄带 (NB)中带 (MB)宽带 (WB) 音频,帧大小范围为 10 毫秒至 60 毫秒,并且需要额外的 5 毫秒前瞻 (look-ahead) 来进行噪声整形估计。进行采样率转换时,可能还需要一个小的额外延迟(最高可达 1.5 毫秒)。

与 Vorbis [VORBIS-WEBSITE] 及许多其他现代编解码器一样,SILK 的内在设计是为了进行可变比特率 (VBR) 编码,不过其编码器也可以生成恒定比特率 (CBR) 的码流。

Opus 中使用的 SILK 版本,与 Skype 先前部署的独立版 SILK 编解码器相比,经过了大幅修改,并且两者互不兼容。本文档的目的并非定义该(独立版)格式,对原始 SILK 编解码器感兴趣的读者应转而参阅 [SILK] 文献。

The MDCT layer is based on the Constrained-Energy Lapped Transform (CELT) codec [CELT]. It supports NB, WB, SWB, or FB audio and frame sizes from 2.5 ms to 20 ms, and requires an additional 2.5 ms look- ahead due to the overlapping MDCT windows. The CELT codec is inherently designed for CBR coding, but unlike many CBR codecs, it is not limited to a set of predetermined rates. It internally allocates bits to exactly fill any given target budget, and an encoder can produce a VBR stream by varying the target on a per-frame basis. The MDCT layer is not used for speech when the audio bandwidth is WB or less, as it is not useful there. On the other hand, non-speech signals are not always adequately coded using linear prediction. Therefore, the MDCT layer should be used for music signals.

MDCT 层基于约束能量重叠变换 (Constrained-Energy Lapped Transform, CELT) 编解码器 [CELT]。它支持窄带 (NB)宽带 (WB)超宽带 (SWB)全频带 (FB) 音频,帧大小范围为 2.5 毫秒至 20 毫秒,并且由于重叠的 MDCT 窗,需要额外的 2.5 毫秒前瞻 (look-ahead)

CELT 编解码器的内在设计是为了进行恒定比特率 (CBR) 编码,但与许多 CBR 编解码器不同,它并不局限于一组预定的码率。它在内部进行比特分配,以精确填满任意给定的目标预算,并且编码器可以通过逐帧改变目标值的方式来生成可变比特率 (VBR) 码流。

当音频带宽为宽带 (WB) 或更低时,MDCT 层不用于处理语音,因为它在这种场景下作用不大。另一方面,使用线性预测来编码非语音信号的效果并不总是理想。因此,MDCT 层应该用于处理音乐信号。

A "Hybrid" mode allows the use of both layers simultaneously with a frame size of 10 or 20 ms and an SWB or FB audio bandwidth. The LP layer codes the low frequencies by resampling the signal down to WB. The MDCT layer follows, coding the high frequency portion of the signal. The cutoff between the two lies at 8 kHz, the maximum WB audio bandwidth. In the MDCT layer, all bands below 8 kHz are discarded, so there is no coding redundancy between the two layers.A "Hybrid" mode allows the use of both layers simultaneously with a frame size of 10 or 20 ms and an SWB or FB audio bandwidth. The LP layer codes the low frequencies by resampling the signal down to WB. The MDCT layer follows, coding the high frequency portion of the signal. The cutoff between the two lies at 8 kHz, the maximum WB audio bandwidth. In the MDCT layer, all bands below 8 kHz are discarded, so there is no coding redundancy between the two layers.

混合”(Hybrid)模式允许在帧大小为 10 或 20 毫秒、音频带宽为超宽带 (SWB)全频带 (FB) 的情况下,同时使用这两个层。

LP 层通过将信号向下重采样至宽带 (WB) ,来对低频部分进行编码。随后,MDCT 层对信号的高频部分进行编码。两者之间的分界点位于 8 kHz,即宽带 (WB) 的最大音频带宽。在 MDCT 层中,所有低于 8 kHz 的频带都会被丢弃,因此这两个层之间不存在编码冗余

The sample rate (in contrast to the actual audio bandwidth) can be chosen independently on the encoder and decoder side, e.g., a fullband signal can be decoded as wideband, or vice versa. This approach ensures a sender and receiver can always interoperate, regardless of the capabilities of their actual audio hardware. Internally, the LP layer always operates at a sample rate of twice the audio bandwidth, up to a maximum of 16 kHz, which it continues to use for SWB and FB. The decoder simply resamples its output to support different sample rates. The MDCT layer always operates internally at a sample rate of 48 kHz. Since all the supported sample rates evenly divide this rate, and since the decoder may easily zero out the high frequency portion of the spectrum in the frequency domain, it can simply decimate the MDCT layer output to achieve the other supported sample rates very cheaply.

采样率(与实际的音频带宽相对)可以在编码器解码器端独立选择。例如,一个全频带信号可以被解码为宽带信号,反之亦然。这种方法确保了发送方和接收方能够始终互操作,无论其各自音频硬件的能力如何。

在内部,LP 层始终以两倍于音频带宽的采样率运行,最高可达 16 kHz,并且在处理 SWBFB 时也沿用该采样率。解码器只需对其输出进行重采样,即可支持不同的采样率。MDCT 层在内部始终以 48 kHz 的采样率运行。由于所有支持的采样率都能被该速率整除,并且解码器可以很方便地在频域中将频谱的高频部分置零,因此它只需对 MDCT 层的输出进行抽取 (decimate) ,就能以很低的成本得到其他支持的采样率。

After conversion to the common, desired output sample rate, the decoder simply adds the output from the two layers together. To compensate for the different look-ahead required by each layer, the CELT encoder input is delayed by an additional 2.7 ms. This ensures that low frequencies and high frequencies arrive at the same time. This extra delay may be reduced by an encoder by using less look- ahead for noise shaping or using a simpler resampler in the LP layer, but this will reduce quality. However, the base 2.5 ms look-ahead in the CELT layer cannot be reduced in the encoder because it is needed for the MDCT overlap, whose size is fixed by the decoder.

将输出转换到共同且所需的目标采样率后,解码器只需将这两个层的输出相加即可。为了补偿每个层所需的不同前瞻 (look-ahead) ,CELT 编码器的输入会被额外延迟 2.7 毫秒。这确保了低频和高频能够同时到达。

编码器可以通过减少用于噪声整形 (noise shaping)前瞻,或在 LP 层中使用更简单的重采样器 (resampler) 来缩短这一额外延迟,但这会降低质量。然而,CELT 层中 2.5 毫秒的基础前瞻无法在编码器端减少,因为它对于 MDCT 重叠 (overlap) 是必需的,而该重叠的大小由解码器固定。

Both layers use the same entropy coder, avoiding any waste from "padding bits" between them. The hybrid approach makes it easy to support both CBR and VBR coding. Although the LP layer is VBR, the bit allocation of the MDCT layer can produce a final stream that is CBR by using all the bits left unused by the LP layer.

这两个层均使用相同的熵编码器 (entropy coder) ,从而避免了它们之间因“填充比特 (padding bits) ”而产生任何浪费。这种混合方法可以轻松支持恒定比特率 (CBR)可变比特率 (VBR) 编码。尽管 LP 层采用 VBR 模式,但 MDCT 层比特分配机制可以通过利用 LP 层未使用完的所有比特,最终生成一个 CBR码流

2.1. Control Parameters

The Opus codec includes a number of control parameters that can be changed dynamically during regular operation of the codec, without interrupting the audio stream from the encoder to the decoder. These parameters only affect the encoder since any impact they have on the bitstream is signaled in-band such that a decoder can decode any Opus stream without any out-of-band signaling. Any Opus implementation can add or modify these control parameters without affecting interoperability. The most important encoder control parameters in the reference encoder are listed below.

Opus 编解码器包含一系列控制参数,这些参数可以在编解码器正常运行期间动态更改,而不会中断从编码器到解码器的音频流。这些参数仅影响编码器,因为它们对码流产生的任何影响都会通过带内信令 (in-band signaling) 进行传递,这样解码器无需任何带外信令 (out-of-band signaling) 即可解码任何 Opus 码流。任何 Opus 的实现版本都可以添加或修改这些控制参数,而不会影响互操作性 (interoperability) 。参考编码器中最重要的编码器控制参数罗列如下。

2.1.1. Bitrate

Opus supports all bitrates from 6 kbit/s to 510 kbit/s. All other parameters being equal, higher bitrate results in higher quality. For a frame size of 20 ms, these are the bitrate "sweet spots" for Opus in various configurations:

o 8-12 kbit/s for NB speech,

o 16-20 kbit/s for WB speech,

o 28-40 kbit/s for FB speech,

o 48-64 kbit/s for FB mono music, and

o 64-128 kbit/s for FB stereo music.

Opus 支持从 6 kbit/s 到 510 kbit/s 的所有比特率 (bitrate) 。在其他所有参数相同的情况下,更高的比特率会带来更高的质量。对于 20 毫秒的帧大小 (frame size) ,以下是 Opus 在各种配置下的最佳比特率区间 (bitrate "sweet spots")

  • 窄带 (NB) 语音:8-12 kbit/s
  • 宽带 (WB) 语音:16-20 kbit/s
  • 全带 (FB) 语音:28-40 kbit/s
  • 全带 (FB) 单声道音乐:48-64 kbit/s
  • 全带 (FB) 立体声音乐:64-128 kbit/s

2.1.2. Number of Channels (Mono/Stereo)

Opus can transmit either mono or stereo frames within a single stream. When decoding a mono frame in a stereo decoder, the left and right channels are identical, and when decoding a stereo frame in a mono decoder, the mono output is the average of the left and right channels. In some cases, it is desirable to encode a stereo input stream in mono (e.g., because the bitrate is too low to encode stereo with sufficient quality). The number of channels encoded can be selected in real-time, but by default the reference encoder attempts to make the best decision possible given the current bitrate.

Opus 可以在单个码流 (stream) 中传输单声道 (mono)立体声 (stereo) 帧 (frame) 。当使用立体声解码器解码一个单声道帧时,其左右声道内容完全相同;而当使用单声道解码器解码一个立体声帧时,其单声道输出是左右声道的平均值。在某些情况下,最好将一个立体声输入流 (stereo input stream) 编码为单声道(例如,因为比特率 (bitrate) 太低,无法以足够的质量对立体声进行编码)。编码的声道数 (number of channels) 可以实时 (real-time) 选择,但在默认情况下,参考编码器 (reference encoder) 会尝试根据当前的比特率做出最佳决策。

2.1.3. Audio Bandwidth

The audio bandwidths supported by Opus are listed in Table 1. Just like for the number of channels, any decoder can decode audio that is encoded at any bandwidth. For example, any Opus decoder operating at 8 kHz can decode an FB Opus frame, and any Opus decoder operating at 48 kHz can decode an NB frame. Similarly, the reference encoder can take a 48 kHz input signal and encode it as NB. The higher the audio bandwidth, the higher the required bitrate to achieve acceptable quality. The audio bandwidth can be explicitly specified in real- time, but, by default, the reference encoder attempts to make the best bandwidth decision possible given the current bitrate.

Opus 所支持的音频带宽 (audio bandwidth) 在表 1 中列出。与声道数 (number of channels) 一样,任何解码器都可以解码以任意带宽编码的音频。例如,任何以 8 kHz 采样率运行的 Opus 解码器都可以解码一个全带 (FB) Opus 帧,而任何以 48 kHz 采样率运行的解码器也都可以解码一个窄带 (NB) 帧。同样地,参考编码器 (reference encoder) 可以接收一个 48 kHz 的输入信号,并将其编码为窄带 (NB) 。音频带宽越高,要达到可接受的质量所需的比特率 (bitrate) 就越高。音频带宽可以实时 (real-time) 明确指定,但在默认情况下,参考编码器会尝试根据当前的比特率做出最佳的带宽决策。

2.1.4. Frame Duration

Opus can encode frames of 2.5, 5, 10, 20, 40, or 60 ms. It can also combine multiple frames into packets of up to 120 ms. For real-time applications, sending fewer packets per second reduces the bitrate, since it reduces the overhead from IP, UDP, and RTP headers. However, it increases latency and sensitivity to packet losses, as losing one packet constitutes a loss of a bigger chunk of audio. Increasing the frame duration also slightly improves coding efficiency, but the gain becomes small for frame sizes above 20 ms. For this reason, 20 ms frames are a good choice for most applications.

Opus 可以编码 2.5、5、10、20、40 或 60 毫秒的帧 (frame) 。它也可以将多个组合成最长可达 120 毫秒的数据包 (packet) 。对于实时应用 (real-time applications) ,每秒发送更少的数据包可以降低比特率 (bitrate) ,因为这减少了 IP、UDP 和 RTP 头 (header) 带来的开销 (overhead) 。然而,这会增加延迟 (latency) 和对丢包 (packet loss) 的敏感性,因为丢失一个数据包就意味着丢失了一段更长的音频。增加帧时长 (frame duration) 也会略微提升编码效率 (coding efficiency) ,但当帧时长超过 20 毫秒后,其增益就变得微不足道。因此,对于大多数应用而言,20 毫秒的是一个很好的选择。

2.1.5. Complexity

There are various aspects of the Opus encoding process where trade- offs can be made between CPU complexity and quality/bitrate. In the reference encoder, the complexity is selected using an integer from 0 to 10, where 0 is the lowest complexity and 10 is the highest. Examples of computations for which such trade-offs may occur are: o The order of the pitch analysis whitening filter [WHITENING], o The order of the short-term noise shaping filter, o The number of states in delayed decision quantization of the residual signal, and o The use of certain bitstream features such as variable time- frequency resolution and the pitch post-filter.

在 Opus 编码过程中,可以在多个方面对 CPU 复杂度 (CPU complexity)质量 (quality) /比特率 (bitrate) 进行权衡。在参考编码器 (reference encoder) 中,通过一个 0 到 10 的整数来选择复杂度 (complexity) ,其中 0 代表最低复杂度,10 代表最高复杂度。可能发生此类权衡的计算示例如下:

  • 基音分析白化滤波器 (pitch analysis whitening filter) [WHITENING] 的阶数 (order)
  • 短时噪声整形滤波器 (short-term noise shaping filter)阶数 (order)
  • 残差信号 (residual signal)延迟决策量化 (delayed decision quantization) 中的状态数 (number of states) ,以及
  • 对某些比特流特性 (bitstream features) 的使用,例如可变时频分辨率 (variable time-frequency resolution)基音后置滤波器 (pitch post-filter)

2.1.6. Packet Loss Resilience

Audio codecs often exploit inter-frame correlations to reduce the bitrate at a cost in error propagation: after losing one packet, several packets need to be received before the decoder is able to accurately reconstruct the speech signal. The extent to which Opus exploits inter-frame dependencies can be adjusted on the fly to choose a trade-off between bitrate and amount of error propagation.

音频编解码器常常利用帧间相关性 (inter-frame correlations) 来降低比特率 (bitrate) ,其代价是会产生错误传播 (error propagation) :在丢失一个数据包后,解码器需要接收好几个数据包,才能准确地重构语音信号。Opus 利用帧间依赖性 (inter-frame dependencies) 的程度可以实时 (on the fly) 调整,以便在比特率错误传播量 (amount of error propagation) 之间进行权衡。

2.1.7. Forward Error Correction (FEC)

Another mechanism providing robustness against packet loss is the in- band Forward Error Correction (FEC). Packets that are determined to contain perceptually important speech information, such as onsets or transients, are encoded again at a lower bitrate and this re-encoded information is added to a subsequent packet.

另一种提供丢包鲁棒性 (robustness against packet loss) 的机制是带内前向纠错 (in-band Forward Error Correction, FEC) 。当某些数据包被判定为包含听觉上重要的 (perceptually important) 语音信息(例如起始音 (onsets)瞬态信号 (transients) )时,它们会以更低的比特率被重新编码,而这部分重新编码的信息会被添加到后续的数据包中

2.1.8. Constant/Variable Bitrate

Opus is more efficient when operating with variable bitrate (VBR), which is the default. When low-latency transmission is required over a relatively slow connection, then constrained VBR can also be used. This uses VBR in a way that simulates a "bit reservoir" and is equivalent to what MP3 (MPEG 1, Layer 3) and AAC (Advanced Audio Coding) call CBR (i.e., not true CBR due to the bit reservoir). In some (rare) applications, constant bitrate (CBR) is required. There are two main reasons to operate in CBR mode:

o When the transport only supports a fixed size for each compressed frame, or

o When encryption is used for an audio stream that is either highly constrained (e.g., yes/no, recorded prompts) or highly sensitive [SRTP-VBR].

Bitrate may still be allowed to vary, even with sensitive data, as long as the variation is not driven by the input signal (for example, to match changing network conditions). To achieve this, an application should still run Opus in CBR mode, but change the target rate before each packet.

Opus 在可变比特率 (Variable Bitrate, VBR) 模式下运行时效率更高,这也是其默认模式。当需要在相对较慢的连接上进行低延迟传输 (low-latency transmission) 时,也可以使用受限 VBR (constrained VBR) 。该模式以模拟 “比特池” (bit reservoir) 的方式使用 VBR,其效果等同于 MP3 (MPEG 1, Layer 3) 和 AAC (Advanced Audio Coding) 所称的 CBR(即,由于存在比特池,它并非真正的 CBR)。

在一些(罕见的)应用中,需要使用恒定比特率 (Constant Bitrate, CBR) 。在 CBR 模式下运行主要有两个原因:

  • 当传输层仅支持每个压缩帧具有固定大小 (fixed size) 时,或者
  • 当对音频流进行加密 (encryption) ,而该音频流要么是高度受限的 (highly constrained) (例如,是/否的回答、录制好的提示音),要么是高度敏感的 (highly sensitive) [SRTP-VBR] 时。

即便处理的是敏感数据,只要比特率的变化不是由输入信号 (input signal) 所驱动(例如,为了适应变化的网络状况),那么比特率的变化也是被允许的。要实现这一点,应用程序仍应以 CBR 模式运行 Opus,但在每个数据包处理之前更改其目标速率 (target rate)

2.1.9. Discontinuous Transmission (DTX)

Discontinuous Transmission (DTX) reduces the bitrate during silence or background noise. When DTX is enabled, only one frame is encoded every 400 milliseconds.

非连续传输 (Discontinuous Transmission, DTX) 会在静音 (silence)背景噪音 (background noise) 期间降低比特率 (bitrate) 。当 DTX 被启用 (enabled) 时,每 400 毫秒仅编码一个帧。

3. Internal Framing

The Opus encoder produces "packets", which are each a contiguous set of bytes meant to be transmitted as a single unit. The packets described here do not include such things as IP, UDP, or RTP headers, which are normally found in a transport-layer packet. A single packet may contain multiple audio frames, so long as they share a common set of parameters, including the operating mode, audio bandwidth, frame size, and channel count (mono vs. stereo). This section describes the possible combinations of these parameters and the internal framing used to pack multiple frames into a single packet. This framing is not self-delimiting. Instead, it assumes that a lower layer (such as UDP or RTP [RFC3550] or Ogg [RFC3533] or Matroska [MATROSKA-WEBSITE]) will communicate the length, in bytes, of the packet, and it uses this information to reduce the framing overhead in the packet itself. A decoder implementation MUST support the framing described in this section. An alternative, self- delimiting variant of the framing is described in Appendix B. Support for that variant is OPTIONAL.

Opus 编码器会生成 “数据包” (packets) ,每个数据包都是一组连续的字节 (a contiguous set of bytes) ,旨在作为单个单元 (a single unit) 进行传输。此处描述的数据包不包含 IP、UDP 或 RTP 头部等内容,这些头部通常存在于传输层数据包 (transport-layer packet) 中。单个数据包可以包含多个音频帧 (multiple audio frames) ,只要这些帧共享一组相同的参数 (a common set of parameters) ,包括运行模式、音频带宽、帧大小和声道数(单声道 vs. 立体声)。

本节描述了这些参数的可能组合,以及用于将多个帧打包到单个数据包中的内部成帧方式 (internal framing) 。这种成帧方式不是自定界的 (not self-delimiting) 。相反,它假定由下层 (a lower layer) (例如 UDP、RTP [RFC3550]、Ogg [RFC3533] 或 Matroska [MATROSKA-WEBSITE])来传递数据包的字节长度,并利用这一信息来减少数据包本身的成帧开销 (framing overhead)

解码器的实现 (A decoder implementation) 必须 (MUST) 支持本节所描述的成帧方式。一种替代性的自定界变体 (self-delimiting variant) 在附录 B 中进行了描述,对此变体的支持是可选的 (OPTIONAL)

All bit diagrams in this document number the bits so that bit 0 is the most significant bit of the first byte, and bit 7 is the least significant. Bit 8 is thus the most significant bit of the second byte, etc. Well-formed Opus packets obey certain requirements, marked [R1] through [R7] below. These are summarized in Section 3.4 along with appropriate means of handling malformed packets.

本文档中的所有位图 (bit diagrams) 都将比特进行如下编号:比特 0 是第一个字节的最高有效位 (most significant bit) ,而比特 7 是其最低有效位 (least significant bit) 。因此,比特 8 是第二个字节的最高有效位,依此类推。

格式良好 (Well-formed) 的 Opus 数据包需遵循特定的要求,这些要求在下文中被标记为 [R1] 至 [R7]。第 3.4 节对这些要求进行了总结,并提供了处理格式错误 (malformed) 数据包的适当方法。

3.1. The TOC Byte

A well-formed Opus packet MUST contain at least one byte [R1]. This byte forms a table-of-contents (TOC) header that signals which of the various modes and configurations a given packet uses. It is composed of a configuration number, "config", a stereo flag, "s", and a frame count code, "c", arranged as illustrated in Figure 1. A description of each of these fields follows.

                          0 1 2 3 4 5 6 7
                         +-+-+-+-+-+-+-+-+
                         | config  |s| c |
                         +-+-+-+-+-+-+-+-+

                      Figure 1: The TOC Byte
                      

一个格式良好 (well-formed) 的 Opus 数据包必须 (MUST) 包含至少一个字节 [R1]。该字节构成一个目录 (table-of-contents, TOC) 头部,用于指示给定的数据包所使用的各种模式与配置。

它由一个配置编号 (“config”) 、一个立体声标志 (“s”) 和一个帧计数代码 (“c”) 组成,其排列方式如图 1 所示。下面是对这些字段的描述。

The top five bits of the TOC byte, labeled "config", encode one of 32 possible configurations of operating mode, audio bandwidth, and frame size. As described, the LP (SILK) layer and MDCT (CELT) layer can be combined in three possible operating modes:

  1. A SILK-only mode for use in low bitrate connections with an audio bandwidth of WB or less,

  2. A Hybrid (SILK+CELT) mode for SWB or FB speech at medium bitrates, and

  3. A CELT-only mode for very low delay speech transmission as well as music transmission (NB to FB).

TOC 字节的高 5 位 (top five bits) ,标记为 “config”,编码了运行模式 (operating mode)音频带宽 (audio bandwidth)帧大小 (frame size) 的 32 种可能配置之一。如前所述,LP (SILK) 层MDCT (CELT) 层可以组合成以下三种可能的运行模式:

  1. 纯 SILK (SILK-only) 模式:用于音频带宽为宽带 (WB) 或更低的低比特率 (low bitrate) 连接

  2. 混合 (Hybrid, SILK+CELT) 模式:用于中等比特率 (medium bitrates) 下的超宽带 (SWB)全带 (FB) 语音。

  3. 纯 CELT (CELT-only) 模式:用于极低延迟 (very low delay) 的语音传输以及音乐传输(从窄带 (NB)全带 (FB) )。

The 32 possible configurations each identify which one of these operating modes the packet uses, as well as the audio bandwidth and the frame size. Table 2 lists the parameters for each configuration.

   +-----------------------+-----------+-----------+-------------------+
   | Configuration         | Mode      | Bandwidth | Frame Sizes       |
   | Number(s)             |           |           |                   |
   +-----------------------+-----------+-----------+-------------------+
   | 0...3                 | SILK-only | NB        | 10, 20, 40, 60 ms |
   |                       |           |           |                   |
   | 4...7                 | SILK-only | MB        | 10, 20, 40, 60 ms |
   |                       |           |           |                   |
   | 8...11                | SILK-only | WB        | 10, 20, 40, 60 ms |
   |                       |           |           |                   |
   | 12...13               | Hybrid    | SWB       | 10, 20 ms         |
   |                       |           |           |                   |
   | 14...15               | Hybrid    | FB        | 10, 20 ms         |
   |                       |           |           |                   |
   | 16...19               | CELT-only | NB        | 2.5, 5, 10, 20 ms |
   |                       |           |           |                   |
   | 20...23               | CELT-only | WB        | 2.5, 5, 10, 20 ms |
   |                       |           |           |                   |
   | 24...27               | CELT-only | SWB       | 2.5, 5, 10, 20 ms |
   |                       |           |           |                   |
   | 28...31               | CELT-only | FB        | 2.5, 5, 10, 20 ms |
   +-----------------------+-----------+-----------+-------------------+

                Table 2: TOC Byte Configuration Parameters

这 32 种可能的配置中的每一种,都指明了该数据包所使用的运行模式、音频带宽以及帧大小。表 2 列出了每种配置的参数。

The configuration numbers in each range (e.g., 0...3 for NB SILK- only) correspond to the various choices of frame size, in the same order. For example, configuration 0 has a 10 ms frame size and configuration 3 has a 60 ms frame size.

One additional bit, labeled "s", signals mono vs. stereo, with 0 indicating mono and 1 indicating stereo.

The remaining two bits of the TOC byte, labeled "c", code the number of frames per packet (codes 0 to 3) as follows:

o 0: 1 frame in the packet

o 1: 2 frames in the packet, each with equal compressed size

o 2: 2 frames in the packet, with different compressed sizes

o 3: an arbitrary number of frames in the packet

This document refers to a packet as a code 0 packet, code 1 packet, etc., based on the value of "c".

每个范围内的配置编号(例如,纯窄带 SILK 模式的 0...3)按相同顺序对应于各种帧大小。例如,配置 0 的帧大小为 10 毫秒,而配置 3 的帧大小为 60 毫秒。

另一个标记为 “s” 的比特位,用于指示单声道与立体声,其中 0 指示单声道,1 指示立体声。

TOC 字节中剩下的、标记为 “c” 的两个比特位,对每个数据包的帧数(代码 0 到 3)进行编码,具体如下:

  • 0: 数据包中有 1 帧
  • 1: 数据包中有 2 帧,每帧的压缩大小相等
  • 2: 数据包中有 2 帧,每帧的压缩大小不同
  • 3: 数据包中有任意数量的帧

本文档根据 “c” 的值,将数据包称为代码 0 数据包、代码 1 数据包等。

3.2. Frame Packing

This section describes how frames are packed according to each possible value of "c" in the TOC byte.

本节描述了如何根据 TOC 字节中 “c” 的每个可能值来打包帧

3.2.1. Frame Length Coding

When a packet contains multiple VBR frames (i.e., code 2 or 3), the compressed length of one or more of these frames is indicated with a one- or two-byte sequence, with the meaning of the first byte as follows:

o 0: No frame (Discontinuous Transmission (DTX) or lost packet)

o 1...251: Length of the frame in bytes o 252...255: A second byte is needed. The total length is (second_byte*4)+first_byte

当数据包包含多个 VBR 帧(即代码 2 或 3)时,会用一个单字节或双字节序列来指示其中一帧或多帧的压缩长度,第一个字节的含义如下:

  • 0: 无帧(非连续传输 (DTX) 或丢包)
  • 1...251: 帧的长度(以字节为单位)
  • 252...255: 需要第二个字节。总长度为 (second_byte * 4) + first_byte。

The special length 0 indicates that no frame is available, either because it was dropped during transmission by some intermediary or because the encoder chose not to transmit it. Any Opus frame in any mode MAY have a length of 0.

The maximum representable length is 255*4+255=1275 bytes. For 20 ms frames, this represents a bitrate of 510 kbit/s, which is approximately the highest useful rate for lossily compressed fullband stereo music. Beyond this point, lossless codecs are more appropriate. It is also roughly the maximum useful rate of the MDCT layer as, shortly thereafter, quality no longer improves with additional bits due to limitations on the codebook sizes.

No length is transmitted for the last frame in a VBR packet, or for any of the frames in a CBR packet, as it can be inferred from the total size of the packet and the size of all other data in the packet. However, the length of any individual frame MUST NOT exceed 1275 bytes [R2] to allow for repacketization by gateways, conference bridges, or other software.

特殊的长度 0 表示没有可用的帧,这可能是因为它在传输过程中被某个中间设备丢弃,也可能是编码器选择不发送它。在任何模式下,任何 Opus 帧的长度都可以(MAY)为 0。

可表示的最大长度为 255*4+255=1275 字节。对于 20 毫秒的帧,这代表了 510 kbit/s 的比特率,这大约是有损压缩全频带(fullband)立体声音乐的最高有效速率。超过这个点,无损编解码器就更为合适。这也大致是 MDCT 层的最大有效速率,因为在此之后不久,由于码本大小的限制,质量不会再随着比特数的增加而提升。

对于 VBR 数据包中的最后一帧,或 CBR 数据包中的任何帧,其长度都不需要传输,因为这个长度可以根据数据包的总大小和包内所有其他数据的大小推断出来。但是,任何单个帧的长度绝不能(MUST NOT)超过 1275 字节 [R2],以便网关、会议桥或其他软件进行重新分包(repacketization)。

3.2.2. Code 0: One Frame in the Packet

For code 0 packets, the TOC byte is immediately followed by N-1 bytes of compressed data for a single frame (where N is the size of the packet), as illustrated in Figure 2.

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|0|0|                                               |
     +-+-+-+-+-+-+-+-+                                               |
     |                    Compressed frame 1 (N-1 bytes)...          :
     :                                                               |
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                         Figure 2: A Code 0 Packet

对于代码为 0 的数据包,紧跟在 TOC 字节之后的是单个帧的 N-1 字节压缩数据(其中 N 是数据包的大小),如图 2 所示。

3.2.3. Code 1: Two Frames in the Packet, Each with Equal Compressed Size

For code 1 packets, the TOC byte is immediately followed by the (N-1)/2 bytes of compressed data for the first frame, followed by (N-1)/2 bytes of compressed data for the second frame, as illustrated in Figure 3. The number of payload bytes available for compressed data, N-1, MUST be even for all code 1 packets [R3].

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|0|1|                                               |
     +-+-+-+-+-+-+-+-+                                               :
     |             Compressed frame 1 ((N-1)/2 bytes)...             |
     :                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               :
     |             Compressed frame 2 ((N-1)/2 bytes)...             |
     :                                               +-+-+-+-+-+-+-+-+
     |                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                         Figure 3: A Code 1 Packet

对于代码为 1 的数据包,紧跟在 TOC 字节之后的是第一个帧的 (N-1)/2 字节压缩数据,随后是第二个帧的 (N-1)/2 字节压缩数据,如图 3 所示。对于所有代码为 1 的数据包,可用于压缩数据的有效载荷字节数 (N-1) 必须(MUST)为偶数 [R3]。

3.2.4. Code 2: Two Frames in the Packet, with Different Compressed Sizes

For code 2 packets, the TOC byte is followed by a one- or two-byte sequence indicating the length of the first frame (marked N1 in Figure 4), followed by N1 bytes of compressed data for the first frame. The remaining N-N1-2 or N-N1-3 bytes are the compressed data for the second frame. This is illustrated in Figure 4. A code 2 packet MUST contain enough bytes to represent a valid length. For example, a 1-byte code 2 packet is always invalid, and a 2-byte code 2 packet whose second byte is in the range 252...255 is also invalid.

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|1|0| N1 (1-2 bytes):                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               :
     |               Compressed frame 1 (N1 bytes)...                |
     :                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
     |                     Compressed frame 2...                     :
     :                                                               |
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                         Figure 4: A Code 2 Packet

对于代码为 2 的数据包,TOC 字节之后是一个一字节或两字节的序列,用以指明第一个帧的长度(在图 4 中标记为 N1),其后是该第一个帧的 N1 字节压缩数据。余下的 N-N1-2 或 N-N1-3 字节是第二个帧的压缩数据。此结构如图 4 所示。代码为 2 的数据包必须(MUST)包含足够的字节来表示一个有效的长度。例如,一个 1 字节的代码为 2 的数据包始终无效;而一个 2 字节的代码为 2 的数据包,若其第二个字节在 252...255 的范围内,也同样无效。

The length of the first frame, N1, MUST also be no larger than the size of the payload remaining after decoding that length for all code 2 packets [R4]. This makes, for example, a 2-byte code 2 packet with a second byte in the range 1...251 invalid as well (the only valid 2-byte code 2 packet is one where the length of both frames is zero).

对于所有代码为 2 的数据包,第一个帧的长度 N1 也必须(MUST)不大于解码该长度后剩余的有效载荷大小 [R4]。例如,这也使得一个其第二个字节在 1...251 范围内的 2 字节代码为 2 的数据包同样无效(唯一有效的 2 字节代码为 2 的数据包是两个帧的长度均为零的情况)。

3.2.5. Code 3: A Signaled Number of Frames in the Packet

Code 3 packets signal the number of frames, as well as additional padding, called "Opus padding" to indicate that this padding is added at the Opus layer rather than at the transport layer. Code 3 packets MUST have at least 2 bytes [R6,R7]. The TOC byte is followed by a byte encoding the number of frames in the packet in bits 2 to 7 (marked "M" in Figure 5), with bit 1 indicating whether or not Opus padding is inserted (marked "p" in Figure 5), and bit 0 indicating VBR (marked "v" in Figure 5). M MUST NOT be zero, and the audio duration contained within a packet MUST NOT exceed 120 ms [R5]. This limits the maximum frame count for any frame size to 48 (for 2.5 ms frames), with lower limits for longer frame sizes. Figure 5 illustrates the layout of the frame count byte.

                              0
                              0 1 2 3 4 5 6 7
                             +-+-+-+-+-+-+-+-+
                             |v|p|     M     |
                             +-+-+-+-+-+-+-+-+

                      Figure 5: The frame count byte

代码为 3 的数据包指示了帧的数量以及额外的填充,这种填充被称为“Opus 填充”,用以表明它是在 Opus 层而非传输层添加的。代码为 3 的数据包必须(MUST)至少包含 2 个字节 [R6,R7]。TOC 字节之后是一个字节,该字节通过其比特 2 到 7 编码数据包中的帧数(在图 5 中标记为 “M”),通过比特 1 指示是否插入了 Opus 填充(在图 5 中标记为 “p”),而比特 0 则指示 VBR(在图 5 中标记为 “v”)。M 绝不能(MUST NOT)为零,并且一个数据包内包含的音频时长绝不能(MUST NOT)超过 120 毫秒 [R5]。这将最大帧数限制为 48(对于 2.5 毫秒的帧),对于更长的帧,该限制值会更低。图 5 展示了该帧数计数字节的布局。

When Opus padding is used, the number of bytes of padding is encoded in the bytes following the frame count byte. Values from 0...254 indicate that 0...254 bytes of padding are included, in addition to the byte(s) used to indicate the size of the padding. If the value is 255, then the size of the additional padding is 254 bytes, plus the padding value encoded in the next byte. There MUST be at least one more byte in the packet in this case [R6,R7]. The additional padding bytes appear at the end of the packet and MUST be set to zero by the encoder to avoid creating a covert channel. The decoder MUST accept any value for the padding bytes, however.

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|1|1|0|p|     M     |  Padding length (Optional)    :
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 1 (R/M bytes)...               :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 2 (R/M bytes)...               :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :                              ...                              :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame M (R/M bytes)...               :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :                  Opus Padding (Optional)...                   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                       Figure 6: A CBR Code 3 Packet

当使用 Opus 填充时,填充的字节数被编码在紧跟帧数计数字节之后的字节中。值为 0...254 表示,除了用于指示填充大小的字节本身之外,还额外包含了 0...254 字节的填充。如果该值为 255,那么额外填充的大小等于 254 字节,再加上由下一个字节所编码的填充值。在这种情况下,数据包中必须(MUST)至少还存在一个字节 [R6,R7]。这些额外的填充字节位于数据包的末尾,并且编码器必须(MUST)将其设置为零,以避免产生隐蔽信道(covert channel)。然而,解码器必须(MUST)接受填充字节的任何值。

Although this encoding provides multiple ways to indicate a given number of padding bytes, each uses a different number of bytes to indicate the padding size and thus will increase the total packet size by a different amount. For example, to add 255 bytes to a packet, set the padding bit, p, to 1, insert a single byte after the frame count byte with a value of 254, and append 254 padding bytes with the value zero to the end of the packet. To add 256 bytes to a packet, set the padding bit to 1, insert two bytes after the frame count byte with the values 255 and 0, respectively, and append 254 padding bytes with the value zero to the end of the packet. By using the value 255 multiple times, it is possible to create a packet of any specific, desired size. Let P be the number of header bytes used to indicate the padding size plus the number of padding bytes themselves (i.e., P is the total number of bytes added to the packet). Then, P MUST be no more than N-2 [R6,R7].

尽管该编码为表示一个给定的填充字节数提供了多种方式,但每种方式都使用不同数量的字节来指示填充大小,因此会使数据包的总大小增加不同的量。例如,要向一个数据包添加 255 个字节,需将填充位 p 设置为 1,在帧数计数字节后插入一个值为 254 的单字节,并在数据包末尾附加 254 个值为零的填充字节。要向一个数据包添加 256 个字节,则需将填充位设置为 1,在帧数计数字节后插入两个字节(值分别为 255 和 0),并在数据包末尾附加 254 个值为零的填充字节。通过多次使用值 255,可以创建出任何特定期望大小的数据包。

设 P 为用于指示填充大小的头部字节数与填充字节本身数量之和(即,P 是添加到数据包的总字节数)。那么,P 必须(MUST)不大于 N-2 [R6,R7]。

In the VBR case, the (optional) padding length is followed by M-1 frame lengths (indicated by "N1" to "N[M-1]" in Figure 7), each encoded in a one- or two-byte sequence as described above. The packet MUST contain enough data for the M-1 lengths after removing the (optional) padding, and the sum of these lengths MUST be no larger than the number of bytes remaining in the packet after decoding them [R7]. The compressed data for all M frames follows, each frame consisting of the indicated number of bytes, with the final frame consuming any remaining bytes before the final padding, as illustrated in Figure 6. The number of header bytes (TOC byte, frame count byte, padding length bytes, and frame length bytes), plus the signaled length of the first M-1 frames themselves, plus the signaled length of the padding MUST be no larger than N, the total size of the packet.

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|1|1|1|p|     M     | Padding length (Optional)     :
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     : N1 (1-2 bytes): N2 (1-2 bytes):     ...       :     N[M-1]    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 1 (N1 bytes)...                :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 2 (N2 bytes)...                :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :                              ...                              :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :                     Compressed frame M...                     :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :                  Opus Padding (Optional)...                   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                       Figure 7: A VBR Code 3 Packet

在 VBR 情况下,(可选的)填充长度之后紧跟着 M-1 个帧的长度(在图 7 中表示为 “N1” 到 “N[M-1]”),每个长度都如上文所述,被编码为一个一字节或两字节的序列。数据包在移除(可选的)填充后,必须(MUST)为这 M-1 个长度值包含足够的数据,并且这些长度的总和绝不能(MUST be no larger than)大于在解码它们之后数据包中剩余的字节数 [R7]。随后是所有 M 帧的压缩数据,其中每个帧由其所指示的字节数构成,而最后一帧则会占用最终填充前的所有剩余字节,如图 6 所示。头部字节(TOC 字节、帧数计数字节、填充长度字节和帧长度字节)的数量,加上前 M-1 帧本身所指示的长度,再加上所指示的填充长度,其总和绝不能(MUST be no larger than)大于数据包的总大小 N。

3.3. Examples

Simplest case, one NB mono 20 ms SILK frame:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |    1    |0|0|0|               compressed data...              :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                 Figure 8

Two FB mono 5 ms CELT frames of the same compressed size:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   29    |0|0|1|               compressed data...              :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                 Figure 9

Two FB mono 20 ms Hybrid frames of different compressed size:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   15    |0|1|1|1|0|     2     |      N1       |               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
   |                       compressed data...                      :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                 Figure 10

Four FB stereo 20 ms CELT frames of the same compressed size:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   31    |1|1|1|0|0|     4     |      compressed data...       :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                 Figure 11

3.4 Receiving Malformed Packets

A receiver MUST NOT process packets that violate any of the rules above as normal Opus packets. They are reserved for future applications, such as in-band headers (containing metadata, etc.). Packets that violate these constraints may cause implementations of this specification to treat them as malformed and discard them.

接收端不得将任何违反上述规则的数据包作为正常的 Opus 包处理。它们被预留用于未来的应用,例如带内报头(包含元数据等)。违反这些约束的数据包可能会导致规范的实现将其视为格式错误并予以丢弃。

These constraints are summarized here for reference:

[R1] Packets are at least one byte.

[R2] No implicit frame length is larger than 1275 bytes.

[R3] Code 1 packets have an odd total length, N, so that (N-1)/2 is an integer.

[R4] Code 2 packets have enough bytes after the TOC for a valid frame length, and that length is no larger than the number of bytes remaining in the packet.

[R5] Code 3 packets contain at least one frame, but no more than 120 ms of audio total.

[R6] The length of a CBR code 3 packet, N, is at least two bytes, the number of bytes added to indicate the padding size plus the trailing padding bytes themselves, P, is no more than N-2, and the frame count, M, satisfies the constraint that (N-2-P) is a non-negative integer multiple of M.

[R7] VBR code 3 packets are large enough to contain all the header bytes (TOC byte, frame count byte, any padding length bytes, and any frame length bytes), plus the length of the first M-1 frames, plus any trailing padding bytes.

这些约束在此总结以供参考:

[R1] 数据包至少为一个字节。

[R2] 隐式帧长度不得大于 1275 字节。

[R3] 编码 1 的数据包总长度 N 为奇数,以使 (N-1)/2 为整数。

[R4] 编码 2 的数据包在 TOC 字节之后必须有足够的字节来表示一个有效的帧长度,且该长度不得大于数据包中的剩余字节数。

[R5] 编码 3 的数据包至少包含一帧,但总音频时长不超过 120 毫秒。

[R6] 恒定比特率 (CBR) 编码 3 数据包的长度 N 至少为两个字节,用于指示填充大小的字节数与尾部填充字节本身的总和 P 不得超过 N-2,并且帧数 M 须满足以下约束:(N-2-P) 是 M 的非负整数倍。

[R7] 可变比特率 (VBR) 编码 3 的数据包,其大小必须足以包含所有报头字节(TOC 字节、帧数计数字节、任何填充长度字节和任何帧长度字节)、前 M-1 帧的长度以及任何尾部填充字节。