联邦学习论文试读： FEDERATED LEARNING: STRATEGIES FOR IMPROVING COMMUNICATION EFFICIENCY

FEDERATED LEARNING: STRATEGIES FOR IMPROVING COMMUNICATION EFFICIENCY

概要

Federated Learning is a machine learning setting where the goal is to train a highquality centralized model while training data remains distributed over a large number of clients each with unreliable and relatively slow network connections. We consider learning algorithms for this setting where on each round, each client independently computes an update to the current model based on its local data, and communicates this update to a central server, where the client-side updates are aggregated to compute a new global model. The typical clients in this setting are mobile phones, and communication efficiency is of the utmost importance. In this paper, we propose two ways to reduce the uplink communication costs: structured updates, where we directly learn an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, where we learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling before sending it to the server. Experiments on both convolutional and recurrent networks show that the proposed methods can reduce the communication cost by two orders of magnitude.

联邦学习是一种可以利用多客户端的分布式数据训练高质量中心化模型的机器学习技术。但是面临着不可靠和相对较慢的网络连接的问题。笔者假设每个客户端在每一轮都可以独立计算，上传基于其本地数据的当前模型，并且将模型发布到中央服务器，中央服务器会合并这些模型并得出全域的模型。典型的客户端就是手机，并且高度取决于网络通信。这篇文章提出两种降低上行链路占用的方法：

structured updates 结构化更新：我们直接从使用更少参数的限制区域学习升级
sketched updates 速写化更新：我们学习全量更新，但是在上传到服务器前使用了某些方法将其压缩

介绍

As datasets grow larger and models more complex, training machine learning models increasingly requires distributing the optimization of model parameters over multiple machines. Existing machine learning algorithms are designed for highly controlled environments (such as data centers) where the data is distributed among machines in a balanced and i.i.d. fashion, and high-throughput networks are available.

数据集的规模和模型的复杂程度与日俱增，在分布的多台机器上训练模型也需要更优的方法。现存的机器训练算法设计之初就仅用于高度可控的环境（比如数据中心），这里的数据在多台机器上分布均匀，并且独立同分布，网络也高度可用。

Recently, Federated Learning (and related decentralized approaches) (McMahan & Ramage, 2017; Konecnˇ y et al., 2016; McMahan et al., 2017; Shokri & Shmatikov, 2015) have been proposed as ´ an alternative setting: a shared global model is trained under the coordination of a central server, from a federation of participating devices. The participating devices (clients) are typically large in number and have slow or unstable internet connections. A principal motivating example for Federated Learning arises when the training data comes from users’ interaction with mobile applications. Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. The training data is kept locally on users’ mobile devices, and the devices are used as nodes performing computation on their local data in order to update a global model. This goes beyond the use of local models that make predictions on mobile devices, by bringing model training to the device as well. The above framework differs from conventional distributed machine learning (Reddi et al., 2016; Ma et al., 2017; Shamir et al., 2014; Zhang & Lin, 2015; Dean ∗Work performed while also affiliated with University of Edinburgh. 1 arXiv:1610.05492v2 [cs.LG] 30 Oct 2017 et al., 2012; Chilimbi et al., 2014) due to the very large number of clients, highly unbalanced and non-i.i.d. data available on each client, and relatively poor network connections. In this work, our focus is on the last constraint, since these unreliable and asymmetric connections pose a particular challenge to practical Federated Learning.

还是背景介绍：法律法规限制厂商不能把用户信息上传到云上来训练模型，必须在本地进行训练，这就产生了联邦学习的需求。联邦学习可以在用户移动设备上本地运行，这些设备会使用自己本地的数据来运算，并作为升级全局模型的节点，同时中心节点也会把训练好的模型下传到各个节点。这也是联邦学习和并行学习的区别：

节点之间区别很大
节点之间负载不平衡
用户数据不是独立同分布
网络连接不稳定

这篇文章主要关注的是最后一个限制：网路问题。不对称，不可靠的网络也是联邦学习面临的主要挑战。

For simplicity, we consider synchronized algorithms for Federated Learning where a typical round consists of the following steps: 1. A subset of existing clients is selected, each of which downloads the current model. 2. Each client in the subset computes an updated model based on their local data. 3. The model updates are sent from the selected clients to the sever. 4. The server aggregates these models (typically by averaging) to construct an improved global model.

为简单起见，笔者认为联邦学习的一个典型回合包括以下步骤：

选取现存客户端的一个子集，这个子集中的每个客户端都下载当前模型。
每个客户端都根据本地数据升级模型
将模型上传
中心服务器合并结果，得出改进模型

A naive implementation of the above framework requires that each client sends a full model (or a full model update) back to the server in each round. For large models, this step is likely to be the bottleneck of Federated Learning due to multiple factors. One factor is the asymmetric property of internet connection speeds: the uplink is typically much slower than downlink. The US average broadband speed was 55.0Mbps download vs. 18.9Mbps upload, with some internet service providers being significantly more asymmetric, e.g., Xfinity at 125Mbps down vs. 15Mbps up (speedtest.net, 2016). Additionally, existing model compressions schemes such as Han et al. (2015) can reduce the bandwidth necessary to download the current model, and cryptographic protocols put in place to ensure no individual client’s update can be inspected before averaging with hundreds or thousands of other updates (Bonawitz et al., 2017) further increase the amount of bits that need to be uploaded.

以上步骤最简单的实现就是要求客户端在每一轮都上传全量模型（或者全量更新）。对于大模型来说，由于多种因素这一步会变为瓶颈。由于不对称的网络连接速度：上传明显比下载慢。部分运行商的网络会更明显。此外，现有的模型压缩技术会使用加密协议，确保在所有模型汇总之前不能观察特定的单个模型，这样做会降低下载带宽，却给上传时增加了额外的内容。

It is therefore important to investigate methods which can reduce the uplink communication cost. In this paper, we study two general approaches:

Structured updates, where we directly learn an update from a restricted space that can be parametrized using a smaller number of variables.
Sketched updates, where we learn a full model update, then compress it before sending to the server. These approaches, explained in detail in Sections 2 and 3, can be combined, e.g., first learning a structured update and sketching it; we do not experiment with this combination in this work though.

本文主要通过两种办法降低上传消耗

结构化更新，每次只上传限定区域的参数
速写化更新，上传全量模型，但是先压缩

这两种方法在23小节说明细节，并且可以组合使用，但是笔者还没有实验这种组合方式。

In the following, we formally describe the problem. The goal of Federated Learning is to learn a model with parameters embodied in a real matrix1 W ∈ R d1×d2 from data stored across a large number of clients. We first provide a communication-naive version of the Federated Learning. In round t ≥ 0, the server distributes the current model Wt to a subset St of nt clients. These clients independently update the model based on their local data. Let the updated local models be W1 t ,W2 t , . . . ,Wnt t , so the update of client i can be written as Hi t := Wi t − Wt, for i ∈ St. These updates could be a single gradient computed on the client, but typically will be the result of a more complex calculation, for example, multiple steps of stochastic gradient descent (SGD) taken on the client’s local dataset. In any case, each selected client then sends the update back to the sever, where the global update is computed by aggregating2 all the client-side updates: Wt+1 = Wt + ηtHt, Ht := 1 nt P i∈St Hi t . The sever chooses the learning rate ηt. For simplicity, we choose ηt = 1. In Section 4, we describe Federated Learning for neural networks, where we use a separate 2D matrix W to represent the parameters of each layer. We suppose that W gets right-multiplied, i.e., d1 and d2 represent the output and input dimensions respectively. Note that the parameters of a fully connected layer are naturally represented as 2D matrices. However, the kernel of a convolutional layer is a 4D tensor of the shape #input × width × height × #output. In such a case, W is reshaped from the kernel to the shape (#input × width × height) × #output.

说明联邦学习的数学模型： sever下发模型 $Wt$ ，客户端进行随机梯度下降后得出update $Ht = Wti - Wt$ ,之后回传到server，进行下一轮迭代 $Wt+1 = Wt + 1/n * sum(Ht)$ 系数n表示学习率

Outline and summary. The goal of increasing communication efficiency of Federated Learning is to reduce the cost of sending Hi t to the server, while learning from data stored across large number of 1 For sake of simplicity, we discuss only the case of a single matrix since everything carries over to setting with multiple matrices, for instance corresponding to individual layers in a deep neural network. 2A weighted sum might be used to replace the average based on specific implementations. 2 devices with limited internet connection and availability for computation. We propose two general classes of approaches, structured updates and sketched updates. In the Experiments section, we evaluate the effect of these methods in training deep neural networks. In simulated experiments on CIFAR data, we investigate the effect of these techniques on the convergence of the Federated Averaging algorithm (McMahan et al., 2017). With only a slight degradation in convergence speed, we are able to reduce the total amount of data communicated by two orders of magnitude. This lets us obtain a good prediction accuracy with an all-convolutional model, while in total communicating less information than the size of the original CIFAR data. In a larger realistic experiment on user-partitioned text data, we show that we are able to efficiently train a recurrent neural network for next word prediction, before even using the data of every user once. Finally, we note that we achieve the best results including the preprocessing of updates with structured random rotations. Practical utility of this step is unique to our setting, as the cost of applying the random rotations would be dominant in typical parallel implementations of SGD, but is negligible, compared to the local training in Federated Learning.

结论：实验结果还可以

structured update 结构化更新

The first type of communication efficient update restricts the updates Hi t to have a pre-specified structure. Two types of structures are considered in the paper: low rank and random mask. It is important to stress that we train directly the updates of this structure, as opposed to approximating/sketching general updates with an object of a specific structure — which is discussed in Section 3.

Low rank. We enforce every update to local model Hi t ∈ R d1×d2 to be a low rank matrix of rank at most k, where k is a fixed number. In order to do so, we express Hi t as the product of two matrices: Hi t = Ai tBi t , where Ai t ∈ R d1×k , Bi t ∈ R k×d2 . In subsequent computation, we generated Ai t randomly and consider a constant during a local training procedure, and we optimize only Bi t . Note that in practical implementation, Ai t can in this case be compressed in the form of a random seed and the clients only need to send trained Bi t to the server. Such approach immediately saves a factor of d1/k in communication. We generate the matrix Ai t afresh in each round and for each client independently.

分解为更低秩的矩阵：这种做法是把之前update $Ht$ 一个d1 * d2 的矩阵分解为 $Ai（d1* k) * Bi(k * d2)$ 两个矩阵，k是一个固定值，随机生成A，并且把A矩阵看作是训练时的一个常量，因此只需要优化B即可，A因此可以压缩成一个随机的seed，这种方法可以减少矩阵的规模。同时在下发model时也为每个客户端生成独立的矩阵seed。

We also tried fixing Bi t and training Ai t , as well as training both Ai t and Bi t ; neither performed as well. Our approach seems to perform as well as the best techniques considered in Denil et al. (2013), without the need of any hand-crafted features. An intuitive explanation for this observation is the following. We can interpret Bi t as a projection matrix, and Ai t as a reconstruction matrix. Fixing Ai t and optimizing for Bi t is akin to asking “Given a given random reconstruction, what is the projection that will recover most information?”. In this case, if the reconstruction is full-rank, the projection that recovers space spanned by top k eigenvectors exists. However, if we randomly fix the projection and search for a reconstruction, we can be unlucky and the important subspaces might have been projected out, meaning that there is no reconstruction that will do as well as possible, or will be very hard to find.

笔者团队也尝试固定Bi训练Ai，以及同时训练Ai、Bi，这两种方法效果不是很好。直观的解释是可以认为B是投影矩阵，A是重建矩阵，固定A来优化B可以看作是：给出一个随机的重建矩阵，哪种投影矩阵可以还原最多的信息。在这种情况下，如果A是满秩的，那么恢复前k个特征向量所跨越的空间的投影就存在。但是如果随机生成一个投影矩阵来求重建矩阵的话，一些重要的特征可能会丢失，这就意味着投影矩阵难以计算或者无解。

Random mask. We restrict the update Hi t to be a sparse matrix, following a pre-defined random sparsity pattern (i.e., a random mask). The pattern is generated afresh in each round and for each client independently. Similar to the low-rank approach, the sparse pattern can be fully specified by a random seed, and therefore it is only required to send the values of the non-zeros entries of Hi t , along with the seed.

随机掩码：笔者限制 $Hi$ 是一个预定义的随机稀疏矩阵。矩阵的样式在每一轮都由server给每个client独立生成。类似于上一种方法，稀疏矩阵的特征也可以是一串随机的seed。

SKETCHED UPDATE 速写化更新

The second type of updates addressing communication cost, which we call sketched, first computes the full Hi t during local training without any constraints, and then approximates, or encodes, the update in a (lossy) compressed form before sending to the server. The server decodes the updates before doing the aggregation. Such sketching methods have application in many domains (Woodruff, 2014). We experiment with multiple tools in order to perform the sketching, which are mutually compatible and can be used jointly:

这种方法不对本地模型做限制，但是会把结果做近似化或者压缩成一种可能有损的形式，之后再上传。之后server在合并结果前需要解码。

Subsampling. Instead of sending Hi t , each client only communicates matrix Hˆ i t which is formed from a random subset of the (scaled) values of Hi t . The server then averages the subsampled updates, producing the global update Hˆ t. This can be done so that the average of the sampled updates is an unbiased estimator of the true average: E[Hˆ t] = Ht. Similar to the random mask structured update, the mask is randomized independently for each client in each round, and the mask itself can be stored as a synchronized seed.

子采样：用 $Hi$ 的子集做update， server求子集的平均值。这种方法可行的原因是子集的平均值是原样本的无偏估计。

Probabilistic quantization. Another way of compressing the updates is by quantizing the weights. We first describe the algorithm of quantizing each scalar to one bit. Consider the update Hi t , let h = (h1, . . . , hd1×d2 ) = vec(Hi t ), and let hmax = maxj (hj ), hmin = minj (hj ). The compressed update of h, denoted by h˜, is generated as follows:

It is easy to show that h˜ is an unbiased estimator of h. This method provides 32× of compression compared to a 4 byte float. The error incurred with this compression scheme was analysed for instance in Suresh et al. (2017), and is a special case of protocol proposed in Konecnˇ y & Richt ´ arik ´ (2016).

One can also generalize the above to more than 1 bit for each scalar. For b-bit quantization, we first equally divide [hmin, hmax] into 2 b intervals. Suppose hi falls in the interval bounded by h 0 and h 00. The quantization operates by replacing hmin and hmax of the above equation by h 0 and h 00, respectively. Parameter b then allows for simple way of balancing accuracy and communication costs.

Another quantization approach also motivated by reduction of communication while averaging vectors was recently proposed in Alistarh et al. (2016). Incremental, randomized and distributed optimization algorithms can be similarly analysed in a quantized updates setting (Rabbat & Nowak, 2005; Golovin et al., 2013; Gamal & Lai, 2016).

Improving the quantization by structured random rotations. The above 1-bit and multi-bit quantization approach work best when the scales are approximately equal across different dimensions.

For example, when max = 1 and min = −1 and most of values are 0, the 1-bit quantization will lead to a large error. We note that applying a random rotation on h before the quantization (multiplying h by a random orthogonal matrix) solves this issue. This claim has been theoretically supported in Suresh et al. (2017). In that work, is shows that the structured random rotation can reduce the quantization error by a factor of O(d/ log d), where d is the dimension of h. We will show its practical utility in the next section

In the decoding phase, the server needs to perform the inverse rotation before aggregating all the updates. Note that in practice, the dimension of h can easily be as high as d = 106 or more, and it is computationally prohibitive to generate (O(d 3 )) and apply (O(d 2 )) a general rotation matrix. Same as Suresh et al. (2017), we use a type of structured rotation matrix which is the product of a Walsh-Hadamard matrix and a binary diagonal matrix. This reduces the computational complexity of generating and applying the matrix to O(d) and O(d log d), which is negligible compared to the local training within Federated Learning.

知乎上有这种概率数值量化的详细解答zhuanlan.zhihu.com/p/409603274

实验

实验部分不看了，blabla说明各个数据集上结果怎么好之类的。