本文已参与「新人创作礼」活动,一起开启掘金创作之路。
本文首发于CSDN。
诸神缄默不语-个人CSDN博文目录 cs224w(图机器学习)2021冬季课程学习笔记集合
@[toc]
这个colab对我来说实在是太难了,我基本上就是直接抄的。勉强算是有所理解吧。我反正是会啥写啥了。 非常欢迎点评指摘。
我将写完的colab 4文件发到了GitHub上,有一些个人做笔记的内容,地址:cs224w-2021-winter-colab/CS224W_Colab_4.ipynb at master · PolarisRisingWar/cs224w-2021-winter-colab
本colab主要实现: 对异质图heterogeneous graphs(有不同类的节点和边)的处理,实现heterogenous message passing,即在不同种类的节点和边之间实现不同种类的信息传递。 本colab主要使用DeepSNAP类对异质图进行操作。[^1] DeepSNAP官方文档:DeepSNAP Documentation — DeepSNAP 0.2.0 documentation DeepSNAP官方GitHub项目:snap-stanford/deepsnap: Python library assists deep learning on graphs
Question 1. DeepSNAP异质图简介
表示异质图所需的图属性:
node_feature: 节点特征The feature of each node (torch.tensor)edge_feature: 边特征The feautre of each edge (torch.tensor)node_label: 节点标签The label of each node (int)node_type: 节点类型The node type of each node (string)edge_type: 边类型The edge type of each edge (string)
在question 1部分,我们将使用图数据集karate club network作为示例。对该数据的介绍可参考我之前写的笔记:图数据集Zachary‘s karate club network详解,包括其在NetworkX、PyG上的获取和应用方式_诸神缄默不语的博客-CSDN博客
首先获取图数据,并按照其不同的类别(指所属club的不同)实现可视化:
from pylab import *
import networkx as nx
from networkx.algorithms.community import greedy_modularity_communities
import matplotlib.pyplot as plt
import copy
G = nx.karate_club_graph()
community_map = {} #key是节点索引,value是所属community的索引(0或1)
for node in G.nodes(data=True):
#node第一个元素是索引,第二个元素是相关数据,如在本例中就是{'club': 'Mr. Hi'}
#默认data=False,就只输出索引
if node[1]["club"] == "Mr. Hi":
community_map[node[0]] = 0
else:
community_map[node[0]] = 1
node_color = []
color_map = {0: 0, 1: 1}
node_color = [color_map[community_map[node]] for node in G.nodes()]
pos = nx.spring_layout(G) #见下文介绍
plt.figure(figsize=(7, 7))
nx.draw(G, pos=pos, cmap=plt.get_cmap('coolwarm'), node_color=node_color)
show()
关于
nx.spring_layout(G):这个是一个用来排布节点的函数,可以美化图可视化图像。
函数文档见:networkx.drawing.layout.spring_layout — NetworkX 2.6.2 documentation
大致功能是输入图数据等参数,返回以节点索引为key、节点对应的坐标为value的dict,dict元素示例:0: array([ 0.42143337, -0.10723518])
排布算法为Fruchterman-Reingold force-directed algorithm[^2],大致是模拟这样的逻辑:将边视为使所连接节点靠近的弹簧,而节点彼此之间有斥力,模拟演化到平衡状态时的布局。
这个的返回值可以置入 nx.draw() 的入参 pos 中,就让所绘制的图节点按这个字典的坐标来布局。
1.1 Question 1.1:分配Node Type and Node Features
用字典 community_map 和图 G 向 G 中增加 node_type 和 node_label 属性:对属于 "Mr. Hi" 俱乐部的节点赋 n0 为 node type、0 为 node label,对属于 "Officer" 俱乐部的节点赋 n1 为 node type、1为 node label。
给所有节点赋特征 [1, 1, 1, 1, 1]。
参考的NetworkX函数 nx.classes.function.set_node_attributes 文档:networkx.classes.function.set_node_attributes — NetworkX 2.6.2 documentation
函数使用示例:
G_eg = nx.path_graph(3)
bb = nx.betweenness_centrality(G) #bb是一个字典
nx.set_node_attributes(G_eg, bb, "betweenness")
G_eg.nodes[1]["betweenness"]
0.053936688311688304
问题答案代码:
import torch
def assign_node_types(G, community_map):
"""
输入NetworkX图G和community map(将节点映射到0/1标签的字典)
在G中增加node_type这一节点属性
"""
new_cm={}
for (k,v) in community_map.items():
if v==0:
new_cm[k]='n0'
else:
new_cm[k]='n1'
#我参考的答案里另一种比较优雅的写法:
#node_type_map = {0:'n0', 1:'n1'}
#node_types = {node:node_type_map[community_map[node]] for node in G.nodes()}
nx.set_node_attributes(G,new_cm,'node_type')
def assign_node_labels(G, community_map):
"""
输入NetworkX图G和community map(将节点映射到0/1标签的字典)
在G中增加node_label这一节点属性
"""
nx.set_node_attributes(G,community_map,'node_label')
def assign_node_features(G):
"""
输入NetworkX图G
在G中增加node_feature这一节点属性
"""
feature_vector=[1, 1, 1, 1, 1]
nx.set_node_attributes(G,feature_vector,'node_feature')
assign_node_types(G, community_map)
assign_node_labels(G, community_map)
assign_node_features(G)
验证函数效果的代码:
for n in G.nodes(data=True):
print(n)
break
(0, {'club': 'Mr. Hi', 'node_type': 'n0', 'node_label': 0, 'node_feature': [1, 1, 1, 1, 1]})
1.2 Question 1.2:分配Edge Types
分配标准:
- Edges within club "Mr. Hi":
e0 - Edges within club "Officer":
e1 - Edges between clubs:
e2
参考的NetworkX函数 nx.classes.function.set_edge_attributes 文档:networkx.classes.function.set_edge_attributes — NetworkX 2.6.2 documentation
问题答案代码:
def assign_edge_types(G, community_map):
"""
输入NetworkX图G和community map(将节点映射到0/1标签的字典)
在G中增加edge_type这一边属性
"""
#注:我觉得题目原来的意思是让用community_map赋值的,但用club属性应该也无所谓……
edge2attr_map={}
for edge in G.edges():
if G.nodes[edge[0]]['club']=='Mr. Hi' and G.nodes[edge[1]]['club']=='Mr. Hi':
edge2attr_map[edge]='e0'
elif G.nodes[edge[0]]['club']=='Officer' and G.nodes[edge[1]]['club']=='Officer':
edge2attr_map[edge]='e1'
else:
edge2attr_map[edge]='e2'
nx.set_edge_attributes(G,edge2attr_map,'edge_type')
assign_edge_types(G, community_map)
验证函数效果的代码:
#PRW
for edge in G.edges(data=True):
print(edge)
break
(0, 1, {'edge_type': 'e0'})
1.3 NetworkX异质图可视化
edge_color = {}
for edge in G.edges():
n1, n2 = edge
if community_map[n1] == community_map[n2] and community_map[n1] == 0:
edge_color[edge] = 'blue'
elif community_map[n1] == community_map[n2] and community_map[n1] == 1:
edge_color[edge] = 'red'
else:
edge_color[edge] = 'green'
G_orig = copy.deepcopy(G)
nx.classes.function.set_edge_attributes(G, edge_color, name='color')
colors = nx.get_edge_attributes(G,'color').values()
labels = nx.get_node_attributes(G, 'node_type')
plt.figure(figsize=(8, 8))
nx.draw(G, pos=pos, cmap=plt.get_cmap('coolwarm'), node_color=node_color, edge_color=colors, labels=labels, font_color='white')
show()
1.4 将NetworkX异质图转换为DeepSNAP异质图
from deepsnap.hetero_graph import HeteroGraph
hete = HeteroGraph(G_orig)
呃注意这部分代码有点难伺候,如果用 G 作为NetworkX backend,就会报 TypeError: Unknown type color in edge attributes. 这个错。
我看了一下对应的源代码:deepsnap.hetero_graph — DeepSNAP 0.2.0 documentation,就发现事情是这样的:
G_orig 的节点属性:
G_orig.nodes(data=True)[0]
输出:
{'club': 'Mr. Hi',
'node_type': 'n0',
'node_label': 0,
'node_feature': [1, 1, 1, 1, 1]}
G_orig 的边属性:
for e in G_orig.edges(data=True):
print(e)
break
输出:
(0, 1, {'edge_type': 'e0'})
G 的边属性:
for e in G.edges(data=True):
print(e)
break
输出:
(0, 1, {'edge_type': 'e0', 'color': 'blue'})
DeepSNAP中对应的代码:
def _get_edge_attributes(self, key: str):
r"""
Similar to the `_get_node_attributes`
"""
attributes = {}
indices = None
# TODO: suspect edge_to_tensor_mapping and edge_to_graph_mapping not useful
if key == "edge_type":
indices = {}
for edge_idx, (head, tail, edge_dict) in enumerate(
self.G.edges(data=True)
):
if key in edge_dict:
head_type = self.G.nodes[head]["node_type"]
tail_type = self.G.nodes[tail]["node_type"]
edge_type = self._get_edge_type(edge_dict)
message_type = (head_type, edge_type, tail_type)
if message_type not in attributes:
attributes[message_type] = []
attributes[message_type].append(edge_dict[key])
if indices is not None:
if message_type not in indices:
indices[message_type] = []
indices[message_type].append(edge_idx)
if len(attributes) == 0:
return None
for message_type, val in attributes.items():
if torch.is_tensor(attributes[message_type][0]):
attributes[message_type] = torch.stack(val, dim=0)
elif isinstance(attributes[message_type][0], float):
attributes[message_type] = torch.tensor(val, dtype=torch.float)
elif isinstance(attributes[message_type][0], int):
attributes[message_type] = torch.tensor(val, dtype=torch.long)
elif (
isinstance(attributes[message_type][0], str)
and key == "edge_type"
):
continue
else:
raise TypeError(f"Unknown type {key} in edge attributes.")
总之简单来说就是除了edge_type之外,边属性都不能是str格式。所以color这个属性就会报错。
但这样我们就很容易产生质疑,那节点属性里面的 club 又是怎么回事呢?然后我简单看了一下 _get_node_attributes() 这个函数,发现反正它没有边属性的那种限制……
我不确定是作者写这玩意时候没整明白,还是我妹整明白,我暂时也懒得问了。如果以后需要用DeepSNAP再去研究。
总之有这么个情况,在此说明。
可以打印出异质图的属性看一下:
for hetero_feature in hete:
print(hetero_feature)
输出略
1.5 Question1.3:每一node type有多少个节点
hete的note_type属性是一个字典,key为node_type值(如 n0),如果key是str则value为类似这样的list:['n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0'];如果key是int则value为Tensor。
def get_nodes_per_type(hete):
num_nodes_n0=len(hete.node_type['n0'])
num_nodes_n1=len(hete.node_type['n1'])
return num_nodes_n0, num_nodes_n1
num_nodes_n0, num_nodes_n1 = get_nodes_per_type(hete)
print("Node type n0 has {} nodes".format(num_nodes_n0))
print("Node type n1 has {} nodes".format(num_nodes_n1))
输出:
Node type n0 has 17 nodes
Node type n1 has 17 nodes
1.6 Question 1.4:每一message type有多少条边
message type是node type和edge type的结合体。
hete.message_types
输出:
[('n0', 'e0', 'n0'), ('n0', 'e2', 'n1'), ('n1', 'e1', 'n1')]
edge_type是键为message_type值的字典,某一元素示例:
hete.edge_type[('n0', 'e0', 'n0')]
输出是一个元素全为 'e0' 的列表,具体略
问题答案代码:
def get_num_message_edges(hete):
"""
返回一个列表,元素为tuple(message_type, num_edge)
"""
message_type_edges = []
for message_type,num_edge in hete.edge_type.items():
message_type_edges.append((message_type,len(num_edge)))
return message_type_edges
message_type_edges = get_num_message_edges(hete)
for (message_type, num_edges) in message_type_edges:
print("Message type {} has {} edges".format(message_type, num_edges))
输出:
Message type ('n0', 'e0', 'n0') has 35 edges
Message type ('n0', 'e2', 'n1') has 11 edges
Message type ('n1', 'e1', 'n1') has 32 edges
1.7 Question 1.5:数据集划分:每一个split中有多少个节点?
DeepSNAP有内置的数据集划分函数。
问题答案代码:
from deepsnap.dataset import GraphDataset
def compute_dataset_split_counts(datasets):
"""
入参:数据集划分后得到的字典(key为'train'/'val'/'test',value为对应的GraphSataset)
返回值:字典(key为'train'/'val'/'test',value为对应split中含有的有标签节点个数)
"""
data_set_splits = {}
for ds_name,ds in datasets.items():
#print(ds_name) train
#print(ds[0].node_label_index) {'n0': tensor([10, 8, 3, 12, 0, 13]), 'n1': tensor([ 0, 8, 1, 15, 5, 7])}
data_set_splits[ds_name]=ds[0].node_label_index['n0'].shape[0]+ds[0].node_label_index['n1'].shape[0]
#这里建议用的node_label_index,但是据我猜测用node_label应该也行
#对node_label_index属性的介绍见下
return data_set_splits
dataset = GraphDataset([hete], task='node')
# Splitting the dataset
dataset_train, dataset_val, dataset_test = dataset.split(transductive=True, split_ratio=[0.4, 0.3, 0.3])
datasets = {'train': dataset_train, 'val': dataset_val, 'test': dataset_test}
data_set_splits = compute_dataset_split_counts(datasets)
for dataset_name, num_nodes in data_set_splits.items():
print("{} dataset has {} nodes".format(dataset_name, num_nodes))
输出:
train dataset has 12 nodes
val dataset has 10 nodes
test dataset has 12 nodes
HeteroGraph.node_label_index: Slicing node label to get the corresponding split G.node_label[G.node_label_index].(出自Introduction — DeepSNAP 0.2.0 documentation)
这写的是个什么玩意儿,这谁看得懂……总之意思就是说可以通过node_label_index来讲数据集划分后的节点通过索引对应到原来的标签,举例来说:
data_train=dataset_train[0]
print(data_train.node_label)
print(data_train.node_label_index)
print(hete.node_label)
print(hete.node_label_index)
print(hete.node_label['n0'][data_train.node_label_index['n0']])
输出:
{'n0': tensor([0, 0, 0, 0, 0, 0]), 'n1': tensor([1, 1, 1, 1, 1, 1])}
{'n0': tensor([ 5, 13, 14, 9, 0, 2]), 'n1': tensor([ 6, 11, 4, 13, 9, 15])}
{'n0': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'n1': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}
{'n0': tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]), 'n1': tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])}
tensor([0, 0, 0, 0, 0, 0])
1.8 DeepSNAP数据集可视化
from deepsnap.dataset import GraphDataset
dataset = GraphDataset([hete], task='node')
# Splitting the dataset
dataset_train, dataset_val, dataset_test = dataset.split(transductive=True, split_ratio=[0.4, 0.3, 0.3])
titles = ['Train', 'Validation', 'Test']
for i, dataset in enumerate([dataset_train, dataset_val, dataset_test]):
n0 = hete._convert_to_graph_index(dataset[0].node_label_index['n0'], 'n0').tolist()
#[21, 5, 7, 8, 16, 11]
#看上下文应该是返回该split中node_type为n0的节点的索引。_convert_to_graph_index()返回Tensor
n1 = hete._convert_to_graph_index(dataset[0].node_label_index['n1'], 'n1').tolist()
plt.figure(figsize=(7, 7))
plt.title(titles[i])
nx.draw(G_orig, pos=pos, node_color="grey", edge_color=colors, labels=labels, font_color='white')
nx.draw_networkx_nodes(G_orig.subgraph(n0), pos=pos, node_color="blue")
#subgraph()应该是返回node-induced subgraph的意思,但我找不到对应的文档,算了
nx.draw_networkx_nodes(G_orig.subgraph(n1), pos=pos, node_color="red")
show()
2. 异质图节点预测任务
这一部分问题应该是修改自DeepSNAP官方的异质图节点预测任务示例代码:deepsnap/node_classification_acm.py at master · snap-stanford/deepsnap 所以我答案也是从别人写的colab4中抄了一部分,从这个里面抄了一部分(毕竟据我猜测老师出这个题就是照着这个官方答案魔改的)。
首先我们假设有一个图 ,其有2种node types 和 ,3种three message types , 和 。
一个heterogeneous layer要包含3个Heterogeneous GNN layers(本colab中的 HeteroGNNConv),每个 HeteroGNNConv 层只对一种message type做message passing和aggregation。
整体算法流程:
在本colab中,第 层heterogeneous GNN layer由第 层Heterogeneous GNN Wrapper layer(即本colab中的 HeteroGNNWrapperConv)进行管理,它直接通过上一层的节点嵌入进行信息传递、聚合到下一层的节点嵌入。
整体算法流程:
2.1 导包
import copy
import torch
import deepsnap
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch_geometric.nn as pyg_nn
from sklearn.metrics import f1_score
from deepsnap.hetero_gnn import forward_op
from deepsnap.hetero_graph import HeteroGraph
from torch_sparse import SparseTensor, matmul
2.2 Heterogeneous GNN Layer
每个message type:
GraphSAGE模型的公式: 为了简化操作,本colab中使用mean作为aggregator:
class HeteroGNNConv(pyg_nn.MessagePassing):
def __init__(self, in_channels_src, in_channels_dst, out_channels):
super(HeteroGNNConv, self).__init__(aggr="mean")
self.in_channels_src = in_channels_src
self.in_channels_dst = in_channels_dst
self.out_channels = out_channels
self.lin_dst=nn.Linear(in_channels_dst,out_channels) #W_d^{(l)[m]}
self.lin_src=nn.Linear(in_channels_src,out_channels) #W_s^{(l)[m]}
self.lin_update=nn.Linear(out_channels*2,out_channels) #W^{(l)[m]}
def forward(
self,
node_feature_src,
node_feature_dst,
edge_index,
size=None,
res_n_id=None,
):
return self.propagate(edge_index,size=size,
node_feature_src=node_feature_src,
node_feature_dst=node_feature_dst,res_n_id=res_n_id)
def message_and_aggregate(self, edge_index, node_feature_src):
# Here edge_index is torch_sparse SparseTensor.
out=matmul(edge_index,node_feature_src,reduce=self.aggr)
#实不相瞒,我没看懂,但是算了,以后再说吧
return out
def update(self, aggr_out, node_feature_dst, res_n_id):
aggr_out=self.lin_src(aggr_out)
node_feature_dst=self.lin_dst(node_feature_dst)
concat_features = torch.cat((node_feature_dst, aggr_out),dim=-1)
#维度-1在这里就是维度1
aggr_out = self.lin_update(concat_features)
return aggr_out
2.3 Heterogeneous GNN Wrapper Layer
在对每一种message type应用GNN层(HeteroGNNConv)时,我们需要在每一层上将它们聚合起来。
在本colab中将应用两种聚合方式。
第一种:mean 节点 的node type是 , 是destination node的node type是 的message type的数量。
第二种:semantic level attention introduced in HAN (Wang et al. (2019))