持续创作,加速成长!这是我参与「掘金日新计划 · 10 月更文挑战」的第8天,点击查看活动详情
前言
之前的文章中,描述了任务背景,即通过漏洞类情报进行文本分类,通过模型判断文本属于系统类漏洞还是应用类漏洞,上一篇文章中描述了基本的数据处理,因为只是作为Baseline,因此没有做任何优化,数据处理没有去停用词、未去低频词,未训word2vec,处理后的数据如下:
系统:
a race condition was found in the way the linux kernel ' s memory subsystem handled the copy - on - write ( cow ) breakage of private read - only shared memory mappings . this flaw allows an unprivileged , local user to gain write access to read - only memory mappings , increasing their privileges on the system .
应用:
bitbucket server 与 data center 是 一款 代码 协作 软件 。 近期 atlassian 官方 发布 安全更新 , 披露 了 cve - 2022 - 36804 bitbucket server and data center 远程 命令 执行 漏洞 。 攻击者 在 可以 接触 到 公开 项目 , 或者 对 私有 项目 拥有 read 权限 的 情况 下 , 可 利用 相关 api 执行 任意 命令 , 控制 服务器 。
方法为使用jieba进行分词,并对文本使用空格进行拼接~。
TextCNN
文本分类任务TextCNN可以说是非常经典的模型,同LSTM,RNN等循环神经网络一样,会作为很多文本分类任务的Baseline,而CNN作为卷积神经网络,常用于CV领域,而TextCNN则是将卷积神经网络应用于自然语言处理领域的先河之一,模型结构没有脱离卷积本身,相较于ResNet或imageNet,网络反而更加简单了,模型结构如下:
对于文本向量的卷积只有一层,如图中的红黄绿部分,经过一层卷积之后进入max-pooling层,将池化后的输出通过softmax层进行分类。与CNN不同的是,CNN通过滑动窗口机制,从二维图片中进行固定步长滑动,生成卷积向量进行特征抽取,而文本不同于图像,每一个token(单词或字符)生成向量,如上图的“i like this movie very much !”,每一个词生成一个词向量,而卷积网络如果filter比较小的话(小于词向量维度)这样卷积没有特定意义,因此filters的size应该等于embedding size,通过不同长度的卷积核来抽取不同的特征,例如论文中使用2 3 4 三个维度,也就是一次卷积2个词、3个词、4个词来抽取特征,这种思想类似于n-gram,很是巧妙,而对于不同长度的filter,则是使用了2个channel,即通过两个不同卷积核的filter进行特征抽取用于防止过拟合。而channel在CV领域则是针对图像RGB三个颜色通道。
模型代码也较为简单,代码如下:
import tensorflow as tf
import numpy as np
class TextCNN(object):
"""
A CNN for text classification.
Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
"""
def __init__(
self, sequence_length, num_classes, vocab_size,
embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0):
# Placeholders for input, output and dropout
self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
# Keeping track of l2 regularization loss (optional)
l2_loss = tf.constant(0.0)
# Embedding layer
with tf.device('/cpu:0'), tf.name_scope("embedding"):
self.W = tf.Variable(
tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
name="W")
self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
# Create a convolution + maxpool layer for each filter size
pooled_outputs = []
for i, filter_size in enumerate(filter_sizes):
with tf.name_scope("conv-maxpool-%s" % filter_size):
# Convolution Layer
filter_shape = [filter_size, embedding_size, 1, num_filters]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
conv = tf.nn.conv2d(
self.embedded_chars_expanded,
W,
strides=[1, 1, 1, 1],
padding="VALID",
name="conv")
# Apply nonlinearity
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
# Maxpooling over the outputs
pooled = tf.nn.max_pool(
h,
ksize=[1, sequence_length - filter_size + 1, 1, 1],
strides=[1, 1, 1, 1],
padding='VALID',
name="pool")
pooled_outputs.append(pooled)
# Combine all the pooled features
num_filters_total = num_filters * len(filter_sizes)
self.h_pool = tf.concat(pooled_outputs, 3)
self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
# Add dropout
with tf.name_scope("dropout"):
self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
# Final (unnormalized) scores and predictions
with tf.name_scope("output"):
W = tf.get_variable(
"W",
shape=[num_filters_total, num_classes],
initializer=tf.contrib.layers.xavier_initializer())
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
l2_loss += tf.nn.l2_loss(W)
l2_loss += tf.nn.l2_loss(b)
self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
self.predictions = tf.argmax(self.scores, 1, name="predictions")
# Calculate mean cross-entropy loss
with tf.name_scope("loss"):
losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
# Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
入参:
- sequence_length 句子长度,过长截断,过短PAD
- num_classes 类别数量,用于确定softmax层输出维度
- vocab_size 词典大小,用于随机生成embedding
- embedding_size embedding矩阵维度
- filter_sizes filter大小,先前提到的 2 3 4
- num_filters channel个数
- l2_reg_lambda l2正则化比重,默认为 0
placeholder
# Placeholders for input, output and dropout
self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
# Keeping track of l2 regularization loss (optional)
l2_loss = tf.constant(0.0)
- self.input_x 文本的token index,维度[batch_size * sequence_length]
- self.input_y 文本的label index,维度[batch_size * num_classes]
- self.dropout_keep_prob 随机失活比例
- l2_loss l2正则化系数,初始值0.0
embedding层
with tf.device('/cpu:0'), tf.name_scope("embedding"):
self.W = tf.Variable(
tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
name="W")
self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
embedding层操作在CPU层面进行处理,防止爆显存,随机初始化embedding矩阵,对输出的token index映射为词向量,维度为 batch_size * sequence_length * embedding_size
pooling层
pooled_outputs = []
for i, filter_size in enumerate(filter_sizes):
with tf.name_scope("conv-maxpool-%s" % filter_size):
# Convolution Layer
filter_shape = [filter_size, embedding_size, 1, num_filters]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
conv = tf.nn.conv2d(
self.embedded_chars_expanded,
W,
strides=[1, 1, 1, 1],
padding="VALID",
name="conv")
# Apply nonlinearity
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
# Maxpooling over the outputs
pooled = tf.nn.max_pool(
h,
ksize=[1, sequence_length - filter_size + 1, 1, 1],
strides=[1, 1, 1, 1],
padding='VALID',
name="pool")
pooled_outputs.append(pooled)
# Combine all the pooled features
num_filters_total = num_filters * len(filter_sizes)
self.h_pool = tf.concat(pooled_outputs, 3)
self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
pooled_outputs 存放卷积输出 分别使用2 3 4大小的卷积核进行卷积计算,2个通道,随后进行max-pooling,使用pooled_outputs进行保存,并对不同卷积核的特征进行连接,作为卷积层的输出。
dropout层
# Add dropout
with tf.name_scope("dropout"):
self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
对卷积层的输出进行随机失活,防止过拟合。
全连接层
# Final (unnormalized) scores and predictions
with tf.name_scope("output"):
W = tf.get_variable(
"W",
shape=[num_filters_total, num_classes],
initializer=tf.contrib.layers.xavier_initializer())
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
l2_loss += tf.nn.l2_loss(W)
l2_loss += tf.nn.l2_loss(b)
self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
self.predictions = tf.argmax(self.scores, 1, name="predictions")
计算y=wx+b,确定模型类别。首先对于随机失活层的输出进行全连接计算,映射为[batch_size* number_class],在通过argmax选取分数最大的类别作为模型输出的依据,即标签。
loss
with tf.name_scope("loss"):
losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
通过softmax计算模型的loss,并加上l2正则化系数。用于模型优化。
accuracy
# Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
用于计算模型正确率,即预测标签与真实标签相同的占所有数据的比例。
在后面文章中,将记录模型的训练代码以及训练过程,ლ(́◉◞౪◟◉‵ლ)感谢