【NLP】漏洞类情报信息抽取--数据准备持续创作，加速成长！这是我参与「掘金日新计划 · 10 月更文挑战」的第1天，点

持续创作，加速成长！这是我参与「掘金日新计划 · 10 月更文挑战」的第1天，点击查看活动详情

前言：

本文章是对开源威胁情报信息抽取-分类任务进行总结，共分为数据准备、数据处理、模型搭建与训练、模型使用等部分，其中数据准备准备部分利用python的requests、BeautifulSoup等基本类库，思路较为简单，适合爬虫入门或上手。此外，本文所抓取的数据会以网盘链接形式分享，由于爬虫会对服务器造成负，，因此爬虫代码仅用于学习，请勿频繁抓取数据，且数据请勿商用。

pip3 install requests

Requests库

python 通过pip可以安装requests库，该库用于发送http请求，相较于urlib2、urlib3等，该库较为简洁，使用起来也相对方便，例如：

import requests

headers = {
    'Connection': 'keep-alive',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
}
response = requests.get("https://www.baidu.com", headers=headers)
print(response)
print(response.text)

运行后可以得到：

其中<Response [200]>表示返回成功的状态码，后面为响应页面的文本。利用requests库，我们可以轻松获得大部分网页的请求数据，如果我们需要得到的数据就在请求的返回体中，解析返回体内容便可得到有效文本。这时候就需要Beautifulsoup4库用于html数据的解析。

使用pip进行安装，如下：

pip install bs4

从官方文档看，使用该库能够解析html文本，并生成语义树，并且提供了丰富的标签选择器，用于帮助定位网页中的元素。

from bs4 import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
print (soup.prettify())
print(soup.select("#secondpara"))


result1:
<html>
 <head>
  <title>
   Page title
  </title>
 </head>
 <body>
  <p align="center" id="firstpara">
   This is paragraph
   <b>
    one
   </b>
   .
  </p>
  <p align="blah" id="secondpara">
   This is paragraph
   <b>
    two
   </b>
   .
  </p>
 </body>
</html>


result2:
[<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>]

例如：soup.select("#secondpara")则是通过元素id获得元素信息。

有了requests和bs4，可以是使用两者结合，对网页数据进行抓取。因此分析阿里漏洞库网站我们发现该网站结构较为简单。且通过URL改变即可完成翻页动作，且数据并非使用ajax动态加载的，因此数据抓取较为容易。

核心代码如下：

def spider(page_number, count):
    headers = {
        'Connection': 'keep-alive',
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    }
    response = requests.get("https://avd.aliyun.com/nvd/list?page={}".format(page_number), headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    for item in (soup.select("tr")[1:]):
        try:
            count += 1
            print(count)
            tr = BeautifulSoup(str(item), 'lxml')
            cve_number = str(tr.select("td")[0]).strip().split()[-2]
            title = str(tr.select("td")[1]).strip().split()[-1].rstrip("</td>").lstrip("<td>")
            href = str(tr.select("td")[0]).strip().split()[3].replace('href="/', '').replace('"', '')
            href = "https://avd.aliyun.com/{}".format(href)
            feature = spider_details(href)
            if feature != {}:
                feature["cve_number"] = cve_number
                feature["title"] = title
                feature["href"] = href
                writer.write("{}\n".format(feature))
                writer.flush()
            time.sleep(0.5)
        except Exception as e:
            print(e)
    return response.text

首先设置请求头，通过requests库获取网页源码，进而通过BeautifulSoup进行解析，观察页面结构，按行进行数据选择soup.select("tr")[1:],其中[1:]表示过滤第一条表头。

对于每一列，使用列选择器select("td")获取我们需要的数据，如每一个CVE编号以及漏洞名称等，包含详情页的href。通过href进行拼接，获得详情页URL，进而抓取详情页，原理方法同上，获得漏洞内容、影响版本等内容。抓取后的数据进行格式化处理为json进行保存，单条数据样式如下：

{
  'content': "a race condition was found in the way the linux kernel's memory subsystem handled the copy-on-write (cow) breakage of private read-only shared memory mappings. this flaw allows an unprivileged, local user to gain write access to read-only memory mappings, increasing their privileges on the system.",
  'company': [
    'ubuntu_18.04'
  ],
  'product': [
    'linux'
  ],
  'version': [
    '*'
  ],
  'influence': [
    '4.13.0',
    '16.19'
  ],
  'type': '系统',
  'cve_number': 'CVE-2022-2590',
  'title': '空标题',
  'href': 'https://avd.aliyun.com/detail?id=AVD-2022-2590'
}

包含文本 content，产品 product，版本 version，影响版本influence，漏洞类型 type， CVE编号 cve_number，标题 title，详情页网址 href。

全量有效数据10W+条，获取原始数据后，便可以对数据进行处理，生成我们想要的训练数据格式，用于模型训练，抓取的数据以及处理方法将在下一篇文章中给出，蟹蟹~