Batch-invert in Curve25519-dalek本篇主要介绍Curve25519-dalek库中的批量求

本篇主要介绍Curve25519-dalek库中的批量求逆运算，并会给出批量求逆和普通求逆运算的效率对比。

Introduction

批量求逆的思想可以应用到任何一种群结构上，比如Curve25519-dalek库中的scalar求逆、底层有限域上元素的求逆等。在这里我们将以Curve25519-dalek库中的scalar求逆运算为例来介绍该算法的核心思想。

Curve25519-dalek库中的scalar定义如下：

pub struct Scalar {
/// `bytes` is a little-endian byte encoding of an integer representing a scalar modulo the
    /// group order.
    ///
    /// # Invariant
    ///
    /// The integer representing this scalar must be bounded above by \\(2\^{255}\\), or
    /// equivalently the high bit of `bytes[31]` must be zero.
    ///
    /// This ensures that there is room for a carry bit when computing a NAF representation.
    //
    // XXX This is pub(crate) so we can write literal constants.  If const fns were stable, we could
    //     make the Scalar constructors const fns and use those instead.
    pub(crate) bytes: [u8; 32],
}

该scalar是由32个字节数组表示的，由于是针对curve25519曲线定义的，因此scalar的取值范围是[0,l)，其中l是相应椭圆曲线群的阶: $l = 2^{252} + 27742317777372353535851937790883648493$ . 并且所有的标量运算都是模l的条件下进行的。

在这里可以将所有scalar（0除外）组成的集合作为一个乘法群，群中运算为模l的乘法运算。并且满足封闭性，结合律。单位元为1，每个元素 $x$ 都存在相应逆 $x^{-1}$ ，也就是关于1的乘法逆: $x*x^{-1}=1$ $mod$ $l$ 。由于l是一个素数，根据费马小定理，可以知道 $x^{l-1}=1 mod l$ ，因此 $x*x^{l-2}=1 mod l$ ，也就是说 $x^{-1}=x^{l-2} mod l$ 。因此存在有效计算群上元素乘法逆的算法，记为invert。

在这个前提条件下，我们研究如何进行批量求逆运算，即对于一个scalar的数组，如果高效计算该数组中所有元素的逆元。一个最基本的算法就是利用invert算法对数组中的scalar挨个求逆，但前面我们看到invert算法需要对元素进行l-2次的指数运算，而l-2本身是一个比较大的数，所以运算起来比较费时。因此，考虑如何提高批量求逆的效率。

不失一般性的，我们假设需要求逆的数组长度为4： $[a,b,c,d]$ ，最终要求输出 $[a^{-1},b^{-1},c^{-1},d^{-1}]$ 。下面将以此为例来介绍curve25519-dalek库中的batch-invert算法。

 pub fn batch_invert(inputs: &mut [Scalar]) -> Scalar {
        // This code is essentially identical to the FieldElement
        // implementation, and is documented there.  Unfortunately,
        // it's not easy to write it generically, since here we want
        // to use `UnpackedScalar`s internally, and `Scalar`s
        // externally, but there's no corresponding distinction for
        // field elements.

        use zeroize::Zeroizing;

        let n = inputs.len();
        let one: UnpackedScalar = Scalar::one().unpack().to_montgomery();

        // Place scratch storage in a Zeroizing wrapper to wipe it when
        // we pass out of scope.
        let scratch_vec = vec![one; n];
        let mut scratch = Zeroizing::new(scratch_vec);

        // Keep an accumulator of all of the previous products
        let mut acc = Scalar::one().unpack().to_montgomery();

        // Pass through the input vector, recording the previous
        // products in the scratch space
   			//----------1----------
        for (input, scratch) in inputs.iter_mut().zip(scratch.iter_mut()) {
            *scratch = acc;

            // Avoid unnecessary Montgomery multiplication in second pass by
            // keeping inputs in Montgomery form
            let tmp = input.unpack().to_montgomery();
            *input = tmp.pack();
            acc = UnpackedScalar::montgomery_mul(&acc, &tmp);
        }
       //-----------2------------
        // acc is nonzero iff all inputs are nonzero
        debug_assert!(acc.pack() != Scalar::zero());

        // Compute the inverse of all products
   			//----------3--------------
        acc = acc.montgomery_invert().from_montgomery();

        // We need to return the product of all inverses later
        let ret = acc.pack();

        // Pass through the vector backwards to compute the inverses
        // in place
   			//---------------4---------------
        for (input, scratch) in inputs.iter_mut().rev().zip(scratch.iter().rev()) {
            let tmp = UnpackedScalar::montgomery_mul(&acc, &input.unpack());
            *input = UnpackedScalar::montgomery_mul(&acc, &scratch).pack();
            acc = tmp;
        }
				
        ret
    }

首先，生成了与scalar数组长度相同的数组scratch以及赋值为1的acc变量。

随后在---1---和---2---的过程中，将acc保存在scratch相应位置上，并将每一个元素与acc累积相乘来更新acc。以输入 $[a,b,c,d]$ 为例，经过这一步，得到的scratch被更新为 $[1,a,ab,abc]$ ，并且 $acc=abcd$ 。

在第---3---步之后，计算了acc的乘法逆，在计算过程中先将acc转成montgomery形式，然后在该形式下求逆，最终再转成原来的scalar形式。也就是，这步执行完后， $acc=(abcd)^{-1}$ 。

第四步则对inputs和scratch进行逆向迭代： $[d,c,b,a]$ & $[abc,ab,a,1]$ ，首先用acc与scratch中的当前元素做乘积，然后将acc与inputs中的当前元素相乘来更新acc的值。下面我们将每一步的迭代结果显示如下：

i=0 ：
- Inputs: $[d,c,b,a]$ -----> $inputs[0]=acc*scratch[0]=(abcd)^{-1}*abc=d^{-1}$ ---> inputs: $[d^{-1},c,b,a]$
- Scratch: $[abc,ab,a,1]$
- $acc=(abcd)^{-1}$ ----> $acc=acc*inputs[0]$ -----> $acc=(abc)^{-1}$
i=1 ：
- Inputs: $[d^{-1},c,b,a]$ -----> $inputs[1]=acc*scratch[1]=(abc)^{-1}*ab=c^{-1}$ ---> inputs: $[d^{-1},c^{-1},b,a]$
- Scratch: $[abc,ab,a,1]$
- $acc=(abc)^{-1}$ ----> $acc=acc*inputs[1]$ -----> $acc=(ab)^{-1}$
i=2 ：
- Inputs: $[d^{-1},c^{-1},b,a]$ -----> $inputs[2]=acc*scratch[2]=(ab)^{-1}*a=b^{-1}$ ---> inputs: $[d^{-1},c^{-1},b^{-1},a]$
- Scratch: $[abc,ab,a,1]$
- $acc=(ab)^{-1}$ ----> $acc=acc*inputs[2]$ -----> $acc=(a)^{-1}$
i=3：
- Inputs: $[d^{-1},c^{-1},b^{-1},a]$ -----> $inputs[3]=acc*scratch[3]=(a)^{-1}*1=a^{-1}$ ---> inputs: $[d^{-1},c^{-1},b^{-1},a^{-1}]$
- Scratch: $[abc,ab,a,1]$
- $acc=(a)^{-1}$ ----> $acc=acc*inputs[3]$ -----> $acc=1$

可以看到，该算法总共只进行了一次求逆运算和3n（n为需要求逆的scalar数组长度）乘法运算。其中注释中标注"---1----"的位置需要n次乘法，标注"---2---"的位置需要2n次乘法。而之前对数组元素单独求逆则需要n次求逆运算。

批量求逆运算可以用在批量解码Ed25519点时的求逆运算：由于Ed25519椭圆曲线群上点的256比特编码只包含了255比特的y坐标和1比特x的正负性，因此在由y坐标计算x坐标时会涉及到求逆运算。这是如果批量地对点进行解码，就可以使用该算法来提高效率。另外，由于ECDSA签名机制在验证时需要求逆运算，也可以利用该算法在批量验证签名时加速执行效率。

Effiency

利用curve25519-dalek库中自带的benchmark工具，对比这两种方法对于批量求逆的效率如下：

#scalar	BatchInvert	Total/us	Average/us
1	False	11.234	11.234
1	True	11.865	11.865
2	True	12.269	6.1345
4	True	12.676	3.169
8	True	13.445	1.6806
16	True	15.484	0.96775

可以看到，BatchInvert在计算的scalar长度大于1时相比单个invert方法有较大的优势，并且随着n=#scalar的增加，效率优势越来越大。这是因为1次求逆运算的成本被均摊到n个scalar上面，最终随着n的增大，对于批量求逆运算，单个scalar求逆的计算复杂度会逐渐逼近3次乘法运算。

*本文由CoinEx Chain开发团队撰写。CoinEx Chain是全球首条基于Tendermint共识协议和Cosmos SDK开发的DEX专用公链，借助IBC来实现DEX公链、智能合约链、隐私链三条链合一的方式去解决可扩展性（Scalability）、去中心化（Decentralization）、安全性（security）区块链不可能三角的问题，能够高性能的支持数字资产的交易以及基于智能合约的Defi应用。