FIR example

原理

HiFI 5 ISA 改进了对16-bit和32-bit复值操作，包括32x32/32x16/16x16的MAC操作，这些操作积累不同的精度，如16/32/17.47/64 bit。
ISA每个周期吞吐量可以提供2个32-bit复值MACs操作。

指令

编译器插入指令SIMD load instructions ae_l32x2.ip
编译器插入指令2-way SIMD, 32-bit x 32-bit multiply instructions ae_mulp32x2

使用 HiFi 5 DSP’s intrinsics 和 C data types.

SIMD 32x2 integer data type ae_int32x2
SIMD 32x4 (4-way) integer data type ae_int32x4
data type ae_int64
load 128 bits of vector_a and 128 bits of vector_b
AE_L32X2X2_IP(a0, a1, Pvec_a, 16);
AE_L32X2X2_IP(b0, b1, Pvec_b, 16);
two 32x32 Dual MACs with 64-bit accumulator
AE_MULAAD32_HH_LL(acc0, a0, b0);

32x32-Bit Precision FIR Example

目标：使用HiFi 5实现8个32x32-bits MACs/cycle
文件：hifi5_FIR.c
输入：2 input arrays for FIR (input/state and coefficients)
输出：compute FIR to store at the output array

注意，这些示例函数中不包括滤波器状态更新，为简单起见，假定输入/状态数组的时间顺序相反。

原始FIR函数

函数：std_FIR()

问题1：How many cycles to execute the function std_FIR? What is the big overhead causing inefficiency?
回答：从Profile得出cycles为23660.
我们没有使用任何编译器开关来告诉编译器使用 HiFi 5 DSP 专用指令和矢量化循环。该算法内循环的每次迭代都需要读取两个 32 bit（32 bits data和 32 bits coefficent）并将其传递给 32x32 MAC。该内循环执行 N 次。

优化1

函数：hifi5_FIR0()
优化：使用HiFi 5数据格式，与single-MAC intrinstic函数。但没有vectorization.

问题2：How many cycles to execute the function hifi5_FIR0? How many MACs/cycle is achieved?
回答：从Profile得出cycles为6091.
从Pipeline得出AE_MULA32_HH的cycles为4032 executes + 1136 misc.
编译器生成的循环内核中有一条{two ae_load，one ae_mul}指令会导致多达 ~1000 个周期的停滞。相反，在循环内核中安排两条这样的指令可以减少停滞。

优化2

函数：hifi5_FIR1()
优化：使用2-way SIMD向量数据类型将内循环和外循环的向量化程度提高 2 倍。
指令：AE_MULAFD32X2RA_FIR_H , 带有两个 64-bit累加器的 Quad 32x32-bit MAC。

问题3：How many cycles to execute the function hifi5_FIR1? What is the reason of the performance not meeting eight 32x32-bit MACs/cycle performance?
回答：从Profile得出cycles为2045.
从循环内核来看，它正在进行 6 次 64-bit 读取和 2 次 FIR (quad）MAC 运算。在相同的 FLIX 编码下，必须在 3 条指令中安排 3 次dual load。
MAC 吞吐量为 8/3 = 2.6667... 在具有两个load/store单元的 HiFi 5 DSP 架构上，FLIX 编码支持dual load和每条指令quad MAC，因此必须对循环代码进行优化，以增加 FIR MAC 操作的次数。

为了实现8个32x32-bit MACs/cycle，FLIX应该发出AE_MULAFD32X2RA_FIR_H在两个slot中。

优化3

函数：hifi5_FIR2()
优化：每次迭代做16个32x32-bit MACs，生成4个部分输出，其中每个部分输出是4个MACs的累加。

指令：AE_MULAFD32X2RA_FIR_H , 带有两个 64-bit累加器的 Quad 32x32-bit MAC。

问题4：How many cycles to execute the function hifi5_FIR2? Did you achieve 8 32x32-bits MACs/cycle?
回答：从Profile得出cycles为785.
我们在一个 FLIX 格式中使用两个quad MACs。也就是说最终我们实现了 8 个 32x32-bits MACs/cycle。我们总共产生 4 个部分输出，每个输出是 4 个 MAC 输出的累加。

DSP学习笔记（三）| Audio DSP Example

FIR example

原理

指令

32x32-Bit Precision FIR Example

原始FIR函数

优化1

优化2

优化3