我们在上一节通过定义模块度来评价社区划分的好坏。那么剩下的就是围绕着优化模块度来进行优化即可。Fast Unfolding算法就是一种基于模块度的社区发现算法。

Fast Unfolding算法

简介

Fast Unfolding算法是一种基于模块度的社区发现算法。其基本思想是网络中节点尝试遍历所有邻居的社区标签，并选择最大化模块度增量的社区标签。在最大化模块度之后，每个社区看成一个新的节点，重复直到模块度不再增大。原始论文可以参考： Fast unfolding of communities in large networks。

算法详述

模块度优化阶段：每个节点将自己作为自己社区标签。每个节点遍历自己的所有邻居节点，尝试将自己的社区标签更新成邻居节点的社区标签，选择模块度增量最大(贪婪思想)的社区标签，直到所有节点都不能通过改变社区标签来增加模块度。
网络凝聚阶段：每个社区合并为一个新的超级节点，超级节点的边权重为原始社区内所有节点的边权重之和，形成一个新的网络。
上述两个阶段迭代进行，直到模块度不再增大。
下图很好的描述了这两个阶段。第一次迭代，经过模块度优化阶段，算法将16个节点划分成4个社区；在网络凝聚阶段，4个社区被凝聚成4个超级节点，并重新更新了边权重。之后就进入第二次迭代中。

Fast Unfolding算法通过引入分步迭代的思想(类比EM算法)，避免陷入贪婪思想导致的局部最优。

优缺点

优点

1、时间复杂度低，适合大规模的网络。
2、社区划分结果稳定，有具体指标能够评估社区划分好坏。
3、消除了模块化分辨率的限制。由于模块度是全局指标，最大化的时候很难发现小型的社区，很容易将小型社区合并。而算法第一次迭代是以单个节点作为社区的粒度，规避了这种问题。
4、天然自带层次化的社区划分结果，每一次迭代的社区划分结果都可以保留下来，作为社区划分的中间结果，可供选择。

缺点

1、社区过大，不能及时收敛。如果我们将模块度类比为损失函数的话，Fast Unfolding在模块度优化阶段采用的贪婪思想很容易将整个社区划分“过拟合”。因为Fast Unfolding是针对点遍历，很容易将一些外围的点加入到原本紧凑的社区中，导致一些错误的合并。这种划分有时候在局部视角是优的，但是全局视角下会变成劣的。

改进方向

针对算法的缺点，我们可以采用“early stop”的思想，设置模块度增量的阈值 $Q_t$ 。
在模块度优化阶段中，我们可以设定一个增量阈值 $Q_{t1}$ 。遍历每个节点的时候，如果模块度增量 $\Delta Q_{t} < Q_{t1}$ ，则不进行社区标签的更新。
在模块度优化阶段后，我们可以设定一个增量阈值 $Q_{t2}$ 。当得到社区划分结果后，如果模块度增量 $\Delta Q_{t} < Q_{t2}$ ，则不进行网络凝聚阶段，直接完成算法。

模块度增量

算法中涉及到一个非常重要的指标，就是模块度增量。因为算法要遍历每个节点，并尝试加入每个邻居节点的社区来计算模块度增量，如果模块度增量计算困难，将会对整个算法的时间复杂度带来极大的影响。
幸运的是，我们根据上一节模块度的认知，发现模块度其实是各个社区的累加，这就意味着当一个节点 $i$ ，加入它的邻居节点 $j$ 的社区 $c_j$ 时，整个模块度有变化的只有涉及节点 $i$ , $j$ 及其社区的计算，其他都是不变的。所以当节点 $i$ 加入节点 $j$ 的社区时，我们可以计算出模块度的增量 $\Delta Q$ 。
当节点 $i$ 加入节点 $j$ 的社区 $c_j$ 后，我们可以计算模块度：

\frac{\sum in + k_{i,in}}{2m} - (\frac{\sum tot + k_i}{2m})^2

其中 $in$ 就是社区 $c_j$ ， $\sum in$ 是社区 $c_j$ 内的度数， $k_{i,in}$ 是节点 $i$ 到社区 $c_j$ 的所有节点的边度数的两倍(因为节点 $i$ 已经并入了社区 $c_j$ ，按照模块度是计算方法，社区的边都要算两遍，因为节点可以作为边的终点，也可以作为边的起点)。 $\sum tot$ 表示的就是社区内节点在随机网络下的度数，由于节点 $i$ 也并入了社区，所以 $\sum tot_{new} = \sum tot + k_i$ 。
当节点 $i$ 尚未加入节点 $j$ 的社区 $c_j$ 前，模块度就是这两部分的累加：

[\frac{\sum in}{2m} - (\frac{\sum tot}{2m})^2] + [0-(\frac{k_i}{2m})^2]

所以我们可以计算每次启发式加入社区时应该计算的模块度增量 $\Delta Q$ ：

[\frac{\sum in + k_{i,in}}{2m} - (\frac{\sum tot + k_i}{2m})^2] - [\frac{\sum in}{2m} - (\frac{\sum tot}{2m})^2-(\frac{k_i}{2m})^2]

详细代码

代码分了两个文件，第一个文件是构造算法类，第二个文件负责输入输出，通过从文件里读入图的数据，实现算法。
值得注意的是，此文件的输入相比之前多了一行，原始三列仍是采用的是边表的数据格式，分别是node_in、node_out、edge_weight，第四行是网络的标识，也就是此文件支持多个网络。如果你有多种网络的混合，第四列做好标识，相同标识的会看成一个网络。如果你只有一个网络，可以第四列全部设成一个值即可，或者自行修改代码。

fast_unfloding.py

import networkx as nx

from itertools import permutations
from itertools import combinations
from collections import defaultdict


class Louvain(object):
    def __init__(self):
        self.MIN_VALUE = 0.01  #early_stop，Q_t_2
        self.node_weights = {}    #节点权重

    # 函数入口
    def getBestPartition(self, graph, param=1.):
        i = 1
        partition_list = []
        node2com, edge_weights = self._setNode2Com(graph)    #初始化点和边

        # 得到节点->社区的关系
        node2com = self._runFirstPhase(node2com, edge_weights, param)
        best_modularity = self.computeModularity(node2com, edge_weights, param)

        # 第一次节点->社区的关系要存储下来，之后就是社区->社区的关系，需要这份数据去映射
        partition = node2com.copy()
        partition_list.append(partition.copy())
        
        # 阶段二会根据阶段一的结果完成社区->大节点的操作，此时的new_node2com已经是社区作为节点，将自己更新为自己的社区标签（类似初始化）
        new_node2com, new_edge_weights = self._runSecondPhase(node2com, edge_weights)

        while True:
            print ("----------------------------")
            print ("第",i,"次迭代，网络模块度为",best_modularity)
            i = i + 1
            new_node2com = self._runFirstPhase(new_node2com, new_edge_weights, param)
            modularity = self.computeModularity(new_node2com, new_edge_weights, param)
            # print (new_node2com)
            # 如果一轮迭代后整体网络的模块度增益过小，则跳出
            if abs(best_modularity - modularity) < self.MIN_VALUE:
                break
            best_modularity = modularity
            # 此时需要对初始的社区划分进行社区更新。
            partition = self._updatePartition(new_node2com, partition)
            partition_list.append(partition.copy())
            #print (partition)
            _new_node2com, _new_edge_weights = self._runSecondPhase(new_node2com, new_edge_weights)
            new_node2com = _new_node2com
            new_edge_weights = _new_edge_weights
        print ("最好的模块度：", best_modularity)
        print ("----------------------------")
        return partition_list
   
    """
    初始化
    node2com是一个dict，key是node编号，value是node所属的社区，先初始化为自己的index
    edge_weights是一个dict的dict，key是node，value是dict，key是另一个node，value是edge_weight
    edge_weights 形式：{'a': defaultdict(<class 'float'>, {'c': 1.0, 'b': 1.0})}
    """
    def _setNode2Com(self,graph):
        node2com = {}
        edge_weights = defaultdict(lambda: defaultdict(float))
        for idx,node in enumerate(graph.nodes()):
            node2com[node] = idx    #给每一个节点初始化赋值一个团id
            for edge in graph[node].items():
                edge_weights[node][edge[0]] = edge[1]["weight"]
        #print (node2com, edge_weights)
        return node2com,edge_weights
    
    """
    第一阶段：遍历所有的节点，尝试将节点并入邻居节点的社区，选择模块度增益最大的社区，直到所有节点都不能通过改变社区来增加模块度。
    """
    def _runFirstPhase(self, node2com, edge_weights, param):
        """
        双重循环，遍历所有的edge_wieght，等价
        for start in edge_weights.keys() ：
            for end, weight in edge_weights[start].items() ：
                weight
        用以计算Tr(e),就是∑e_ii的和
        """
        all_edge_weights = sum(
            [weight for start in edge_weights.keys() for end, weight in edge_weights[start].items()]) / 2
        """
        计算每个节点的a_i
        """
        self.node_weights = self.updateNodeWeights(edge_weights) #输出一个字典，每个node对应node上边的权重和
        status = True
        while status:
            statuses = []
            for node in node2com.keys():   # 逐一选择节点和周边连接的节点进行比较
                statuses = []
                com_id = node2com[node]    # 获取节点对应的社团编号
                neigh_nodes = [edge[0] for edge in self.getNeighborNodes(node, edge_weights)] #获取连接的所有边节点

                max_delta = 0            # early_stop，用于计算比对，#Q_t_1
                max_com_id = com_id         # 默认当前社团id为最大社团id
                communities = {}
                """
                遍历每个节点的邻居节点，
                """
                for neigh_node in neigh_nodes:
                    node2com_copy = node2com.copy()
                    if node2com_copy[neigh_node] in communities:
                        continue
                    communities[node2com_copy[neigh_node]] = 1
                    node2com_copy[node] = node2com_copy[neigh_node] # 将节点分配到邻居节点的社区中，也就是将邻居节点的社区更新给该节点

                    """
                    计算节点i加入社区前后的模块度之差，选择能最大化的模块度增益的社区
                    """
                    delta_q = 2 * self.getNodeWeightInCluster(node, node2com_copy, edge_weights) - (self.getTotWeight(
                        node, node2com_copy, edge_weights) * self.node_weights[node] / all_edge_weights) * param
                    
                    """
                    设置增益阈值，防止由于贪婪思想导致没有及时收敛
                    """
                    if delta_q > max_delta:
                        max_delta = delta_q                     # max_delta 选择最大的增益的node
                        max_com_id = node2com_copy[neigh_node]  # 对应 max_com_id 选择最大的增益的临接node的id

                node2com[node] = max_com_id
                statuses.append(com_id != max_com_id)
            
            # 直到所有节点都不能通过改变社区来增加模块度，完成第一阶段
            if sum(statuses) == 0:
                break

        return node2com
    
    """
    计算每个节点的度（无权图），将所有边权重求和。
    node_weights是一个dict, key是node，value是节点的度node_weight。
    """
    def updateNodeWeights(self, edge_weights):
        node_weights = defaultdict(float)
        for node in edge_weights.keys():
            node_weights[node] = sum([weight for weight in edge_weights[node].values()])
        return node_weights
        
    """
    dict.items(),返回的是(key,value)组成的tuple的list，取[0]就是key，也就是邻居节点
    """
    def getNeighborNodes(self, node, edge_weights):
        if node not in edge_weights:
            return 0
        #print (edge_weights[node].items())
        return edge_weights[node].items()

    """
    节点与社区内其他节点的边权重之和
    """
    def getNodeWeightInCluster(self, node, node2com, edge_weights):
        neigh_nodes = self.getNeighborNodes(node, edge_weights)
        node_com = node2com[node]
        weights = 0.
        for neigh_node in neigh_nodes:
            if node_com == node2com[neigh_node[0]]:
                weights += neigh_node[1]
        return weights
    
    """
    社区内所有节点的度数之和
    """
    def getTotWeight(self, node, node2com, edge_weights):
        nodes = [n for n, com_id in node2com.items() if com_id == node2com[node] and node != n]

        weight = 0.
        for n in nodes:
            weight += sum(list(edge_weights[n].values()))
        return weight 
    
    """
    计算网络整体的模块度
    """
    def computeModularity(self, node2com, edge_weights, param):
        q = 0
        all_edge_weights = sum(
            [weight for start in edge_weights.keys() for end, weight in edge_weights[start].items()]) / 2

        com2node = defaultdict(list)
        for node, com_id in node2com.items():
            com2node[com_id].append(node)

        for com_id, nodes in com2node.items():
            node_combinations = list(combinations(nodes, 2)) + [(node, node) for node in nodes]
            cluster_weight = sum([edge_weights[node_pair[0]][node_pair[1]] for node_pair in node_combinations])
            tot = self.getDegreeOfCluster(nodes, node2com, edge_weights)
            q += (cluster_weight / (2 * all_edge_weights)) - param * ((tot / (2 * all_edge_weights)) ** 2)
        return q 
    

    """
    将社区com_id看做node，rebuild成新网络重新进行社区划分，同时社区间的边权重更新为新节点间的边权重
    举个例子，节点1,2划分a，3,4划分为b。那么新的网络就是2个点a,b。边a->b的权重就是(1->3)+(1->4)+(2->3)+(2->4)
    """
    def _runSecondPhase(self, node2com, edge_weights):
        com2node = defaultdict(list)

        new_node2com = {}
        new_edge_weights = defaultdict(lambda: defaultdict(float))

        for node, com_id in node2com.items():
            com2node[com_id].append(node)  #添加同一一个社团id对应的node，用来计算
            if com_id not in new_node2com:
                new_node2com[com_id] = com_id

        nodes = list(node2com.keys())
        node_pairs = list(permutations(nodes, 2)) + [(node, node) for node in nodes]
        #print ("node_pairs!", node_pairs)
        for edge in node_pairs:
            new_edge_weights[new_node2com[node2com[edge[0]]]][new_node2com[node2com[edge[1]]]] += edge_weights[edge[0]][
                edge[1]]
        return new_node2com, new_edge_weights


    def getDegreeOfCluster(self, nodes, node2com, edge_weights):
        weight = sum([sum(list(edge_weights[n].values())) for n in nodes])
        return weight

    def _updatePartition(self, new_node2com, partition):
        reverse_partition = defaultdict(list)
        for node, com_id in partition.items():
            reverse_partition[com_id].append(node)

        for old_com_id, new_com_id in new_node2com.items():
            for old_com in reverse_partition[old_com_id]:
                partition[old_com] = new_com_id
        return partition 

    @classmethod
    def convertIGraphToNxGraph(cls, igraph):
        node_names = igraph.vs["name"]
        edge_list = igraph.get_edgelist()
        weight_list = igraph.es["weight"]
        node_dict = defaultdict(str)

        for idx, node in enumerate(igraph.vs):
            node_dict[node.index] = node_names[idx]

        convert_list = []
        for idx in range(len(edge_list)):
            edge = edge_list[idx]
            new_edge = (node_dict[edge[0]], node_dict[edge[1]], weight_list[idx])
            convert_list.append(new_edge)

        convert_graph = nx.Graph()
        convert_graph.add_weighted_edges_from(convert_list)
        return convert_graph

test.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Apr  3 14:38:58 2020

@author: xxx
"""

import networkx as nx
from  fast_unfolding import *

from collections import defaultdict

def makeSampleGraph(file_path):
    """
    :return:    获取一个图
    """
    g_dict = {}
    with open(file_path) as fp:
        for line in fp.readlines():
            n1,n2,weight,cate = line.strip().split("\t")
            # if cate == "纪录片":
            g_dict.setdefault(cate, nx.Graph())
            g_dict[cate].add_edge(n1, n2, weight = float(weight))

    return g_dict

if __name__ == "__main__":
    tag_net = "/Users/xxx/Desktop/TagNet/tag_net.txt"
    sample_graph_dict = makeSampleGraph(tag_net)
    #先清空
    filename = "/Users/xxx/Desktop/TagNet/tag_result.txt"
    file_object = open(filename, 'w')
    file_object.close()
    for cate, sample_graph in sample_graph_dict.items():
        #print(sample_graph.nodes,sample_graph.edges)
        print ("----------------------------")
        print("“",cate,"”类目的网络开始执行")
        
        louvain = Louvain()
        partition_list = louvain.getBestPartition(sample_graph)
        p = defaultdict(list)
        i = 1
        for partition in partition_list:
            for node, com_id in partition.items():
                p[str(i) + "_" +str(com_id)].append(node)
            i = i + 1
            pre_partition = partition

        #写入文件
        with open(filename, 'a') as file_object:
            for com, nodes in p.items():
                file_object.write(str(com) + "\t" + str(nodes) + "\t" + cate + "\n")  
                #print(com, nodes, cate)
        file_object.close()

网络算法系列之社区发现（三）：Fast Unfolding算法