自然语言处理系统文档自然语言处理系统文档 1. 程序功能概述本文档是一个基于 Python 的简单的自然语言处理（NL

自然语言处理系统文档

1. 程序功能概述

本文档是一个基于 Python 的自然语言处理（NLP）系统，包含以下主要功能：

文本处理：支持 txt、xlsx 文件的读取和写入。
分词与词性标注：使用 jieba 进行中文分词，支持自定义词典。
词频统计：计算高频词并进行可视化展示。
实体识别：识别并提取人名、地名、武器名，并保存到 txt 文件。
可视化：支持多种数据展示方式，如柱状图、折线图、词云、关系图等。
图形界面：提供 Web 界面（Django）以供用户操作。

2. 设计思想

模块化：每个功能独立封装成函数，提高可复用性。
数据驱动：采用 pandas 进行数据处理，matplotlib/seaborn 进行数据可视化。
用户友好：提供 Web 交互界面，降低使用门槛。

3. 主要技术与库

功能	主要库
文本处理	openpyxl
分词与词性	jieba
词频统计	collections
可视化	matplotlib, seaborn, wordcloud
GUI	easygui, Tkinter
Web 开发	Django

4. 代码实现概述

文本处理

安装依赖

首先，确保安装了必要的 Python 库：
```
 pip install jieba pandas openpyxl
```

代码实现

 import jieba
 import jieba.posseg as pseg
 import pandas as pd
 from collections import Counter
 
 #  1. read_txt()`和 read_xlsx() 读取文本和 Excel 数据。
 #  2. tokenize()` 使用 `jieba.lcut()` 进行分词。
 #  3. pos_tagging()` 进行词性标注，返回词和词性对。
 #  4. word_frequency()` 统计词频并返回最常见的 N 个词。
 #  5. extract_entities()` 识别 人名（nr）、地名（ns） 和 武器名。
 #  6. save_to_txt() 和 save_to_xlsx()` 将结果存入文件。
 
 # 自定义词典
 jieba.load_userdict("custom_dict.txt")  # 需要自己创建一个 custom_dict.txt
 
 # 读取 TXT 文件
 def read_txt(file_path):
     with open(file_path, "r", encoding="utf-8") as f:
         return f.read()
 
 # 读取 Excel 文件
 def read_xlsx(file_path, sheet_name=0):
     df = pd.read_excel(file_path, sheet_name=sheet_name)
     return df
 
 # 分词
 def tokenize(text):
     words = jieba.lcut(text)
     return words
 
 # 词性标注
 def pos_tagging(text):
     words = pseg.cut(text)
     return [(word, flag) for word, flag in words]
 
 # 词频统计
 def word_frequency(words, top_n=20):
     return Counter(words).most_common(top_n)
 
 # 统计特定词汇（如人名、地名、武器名）
 def extract_entities(text):
     words = pseg.cut(text)
     persons, locations, weapons = [], [], []
     for word, flag in words:
         if flag == "nr":  # 人名
             persons.append(word)
         elif flag == "ns":  # 地名
             locations.append(word)
         elif flag in ["n", "nz"]:  # 武器（假设某些武器是名词）
             weapons.append(word)
     return persons, locations, weapons
 
 # 保存结果到 TXT
 def save_to_txt(data, filename):
     with open(filename, "w", encoding="utf-8") as f:
         for item in data:
             f.write(item + "\n")
 
 # 保存结果到 Excel
 def save_to_xlsx(data, filename, sheet_name="Sheet1"):
     df = pd.DataFrame(data, columns=["Word", "Count"])
     df.to_excel(filename, sheet_name=sheet_name, index=False)
 
 
 if __name__ == "__main__":
     text = read_txt("sample.txt")
 
     # 分词
     words = tokenize(text)
     print("分词结果：", words[:20])
 
     # 词性标注
     pos_words = pos_tagging(text)
     print("词性标注：", pos_words[:20])
 
     # 词频统计
     word_freq = word_frequency(words)
     print("词频统计：", word_freq)
 
     # 识别实体
     persons, locations, weapons = extract_entities(text)
     print("人名：", persons)
     print("地名：", locations)
     print("武器：", weapons)
 
     # 保存结果
     save_to_txt(persons, "persons.txt")
     save_to_txt(locations, "locations.txt")
     save_to_txt(weapons, "weapons.txt")
     save_to_xlsx(word_freq, "word_frequency.xlsx")

可视化

安装依赖

如果尚未安装 matplotlib 和 wordcloud，请运行：
```
 pip install matplotlib wordcloud
```
代码实现

 import matplotlib.pyplot as plt
 from wordcloud import WordCloud
 import jieba
 import jieba.posseg as pseg
 from collections import Counter
 
 # plot_bar_chart() 绘制柱状图，展示高频词出现次数
 # plot_line_chart() 绘制折线图，观察词频趋势
 # plot_pie_chart() 统计词性占比，生成饼状图
 # plot_wordcloud() 生成词云，直观展示常见词
 
 
 # 读取文本文件
 def read_txt(file_path):
     with open(file_path, "r", encoding="utf-8") as f:
         return f.read()
 
 # 分词并统计词频
 def tokenize_and_count(text, top_n=20):
     words = jieba.lcut(text)
     return Counter(words).most_common(top_n)
 
 # 生成柱状图
 def plot_bar_chart(word_freq):
     words, counts = zip(*word_freq)  
     plt.figure(figsize=(10, 5))
     plt.bar(words, counts, color='skyblue')
     plt.xlabel("词语")
     plt.ylabel("出现次数")
     plt.title("高频词柱状图")
     plt.xticks(rotation=45)  
     plt.show()
 
 # 生成折线图
 def plot_line_chart(word_freq):
     words, counts = zip(*word_freq)
     plt.figure(figsize=(10, 5))
     plt.plot(words, counts, marker='o', color='red', linestyle='-')
     plt.xlabel("词语")
     plt.ylabel("出现次数")
     plt.title("高频词折线图")
     plt.xticks(rotation=45)
     plt.show()
 
 # 生成饼状图（按词性统计）
 def plot_pie_chart(text):
     words = pseg.cut(text)
     pos_counter = Counter(flag for _, flag in words)
     
     labels, values = zip(*pos_counter.items())
     
     plt.figure(figsize=(8, 8))
     plt.pie(values, labels=labels, autopct='%1.1f%%', colors=['red', 'blue', 'green', 'yellow', 'purple'])
     plt.title("词性分布饼状图")
     plt.show()
 
 # 生成词云
 def plot_wordcloud(word_freq):
     word_dict = dict(word_freq)
     wordcloud = WordCloud(font_path="simhei.ttf", background_color="white", width=800, height=400).generate_from_frequencies(word_dict)
     
     plt.figure(figsize=(10, 5))
     plt.imshow(wordcloud, interpolation="bilinear")
     plt.axis("off")  
     plt.title("词云图")
     plt.show()
 
 if __name__ == "__main__":
     text = read_txt("sample.txt")
 
     # 分词并统计词频
     word_freq = tokenize_and_count(text)
 
     # 生成可视化图表
     plot_bar_chart(word_freq)  # 柱状图
     plot_line_chart(word_freq)  # 折线图
     plot_pie_chart(text)  # 饼状图（词性分布）
     plot_wordcloud(word_freq)  # 词云图

网站实现

我们使用 Django 开发 Web 版的 NLP 词频分析和可视化系统。

整个项目的结构如下：

项目结构

 nlp_project/
 │── nlp_web/           
 │   ├── manage.py
 │   ├── nlp_app/       # Django 应用
 │   │   ├── templates/ # HTML 模板
 │   │   ├── static/    # CSS, JS
 │   │   ├── views.py   # 视图逻辑
 │   │   ├── urls.py    # 路由
 │   │   ├── models.py  # 数据模型（可选）
 │   │   ├── forms.py   # 上传文件表单
 │── data/              # 存放上传的文本文件
 │── requirements.txt   # 依赖库

1. 安装 Django

如果未安装 Django，先安装：

 pip install django matplotlib wordcloud jieba pandas openpyxl

然后创建 Django 项目：

 django-admin startproject nlp_web
 cd nlp_web
 django-admin startapp nlp_app

编辑 nlp_web/settings.py，找到 INSTALLED_APPS，添加：

 INSTALLED_APPS = [
     "nlp_app",
     "django.contrib.admin",
     "django.contrib.auth",
     "django.contrib.contenttypes",
     "django.contrib.sessions",
     "django.contrib.messages",
     "django.contrib.staticfiles",
 ]

2. 创建 Django 视图

nlp_app/views.py

 import os
 import jieba
 import matplotlib.pyplot as plt
 from django.shortcuts import render
 from django.http import HttpResponse
 from collections import Counter
 from wordcloud import WordCloud
 from django.core.files.storage import FileSystemStorage
 
 # 读取文件内容
 def read_file(file_path):
     with open(file_path, "r", encoding="utf-8") as f:
         return f.read()
 
 # 词频统计
 def word_frequency(text, top_n=20):
     words = jieba.lcut(text)
     return Counter(words).most_common(top_n)
 
 # 生成柱状图
 def plot_bar_chart(word_freq):
     words, counts = zip(*word_freq)
     plt.figure(figsize=(10, 5))
     plt.bar(words, counts, color="skyblue")
     plt.xticks(rotation=45)
     plt.title("词频统计")
     plt.xlabel("词语")
     plt.ylabel("出现次数")
     plt.savefig("nlp_app/static/bar_chart.png")  # 保存图片
     plt.close()
 
 # 生成词云
 def plot_wordcloud(word_freq):
     word_dict = dict(word_freq)
     wordcloud = WordCloud(font_path="simhei.ttf", width=800, height=400, background_color="white").generate_from_frequencies(word_dict)
     wordcloud.to_file("nlp_app/static/wordcloud.png")  # 保存图片
 
 # 主页
 def index(request):
     return render(request, "index.html")
 
 # 上传文件并进行分析
 def analyze_text(request):
     if request.method == "POST":
         uploaded_file = request.FILES["file"]
         fs = FileSystemStorage(location="data/")  # 存储上传的文件
         file_path = fs.save(uploaded_file.name, uploaded_file)
         file_content = read_file("data/" + file_path)
 
         # 计算词频
         word_freq = word_frequency(file_content)
 
         # 生成可视化图表
         plot_bar_chart(word_freq)
         plot_wordcloud(word_freq)
 
         return render(request, "result.html", {"word_freq": word_freq})
 
     return HttpResponse("请上传文本文件")

3. 创建 URL 路由

** nlp_app/urls.py**

 from django.urls import path
 from .views import index, analyze_text
 
 urlpatterns = [
     path("", index, name="index"),
     path("analyze/", analyze_text, name="analyze"),
 ]

nlp_web/urls.py

 from django.contrib import admin
 from django.urls import include, path
 
 urlpatterns = [
     path("admin/", admin.site.urls),
     path("", include("nlp_app.urls")),
 ]

4. 创建前端页面

首页（index.html）

** nlp_app/templates/index.html**

 <!DOCTYPE html>
 <html lang="zh">
 <head>
     <meta charset="UTF-8">
     <title>自然语言处理系统</title>
 </head>
 <body>
     <h1>上传文本文件进行词频分析</h1>
     <form action="{% url 'analyze' %}" method="post" enctype="multipart/form-data">
         {% csrf_token %}
         <input type="file" name="file" required>
         <button type="submit">提交</button>
     </form>
 </body>
 </html>

结果页（result.html）

nlp_app/templates/result.html

 <!DOCTYPE html>
 <html lang="zh">
 <head>
     <meta charset="UTF-8">
     <title>分析结果</title>
 </head>
 <body>
     <h1>词频统计结果</h1>
     <table border="1">
         <tr>
             <th>词语</th>
             <th>出现次数</th>
         </tr>
         {% for word, count in word_freq %}
         <tr>
             <td>{{ word }}</td>
             <td>{{ count }}</td>
         </tr>
         {% endfor %}
     </table>
 
     <h2>可视化图表</h2>
     <img src="/static/bar_chart.png" alt="柱状图">
     <img src="/static/wordcloud.png" alt="词云">
 
     <br><a href="/">返回主页</a>
 </body>
 </html>

5. 运行 Django 服务器

首先，迁移数据库：

 python manage.py migrate

然后，运行 Django 服务器：

 python manage.py runserver

打开浏览器访问：

 http://127.0.0.1:8000/

现在你可以上传 .txt 文件，系统会为你自动生成词频统计、柱状图和词云图。

自然语言处理系统文档

自然语言处理系统文档

1. 程序功能概述

2. 设计思想

3. 主要技术与库

4. 代码实现概述

文本处理

安装依赖

代码实现

可视化

网站实现

项目结构

1. 安装 Django

2. 创建 Django 视图

3. 创建 URL 路由

4. 创建前端页面

首页（index.html）

结果页（result.html）

5. 运行 Django 服务器