简单自然语言处理简单自然语言处理，部分内容可能说成字符串字符串处理更合适......................

简单自然语言处理

一、字符串方法

strip()：
1.当括号中什么都不填时，是剔除掉字符串左右两端的空格；
2.当括号中填写字符时，是剔除掉字符串左右两端的该字符；
3.lstrip和rstrip亦然；

replace()：替换

find()：查找

isalpha()：是否是纯字母

isdigit()：是否是纯数字

split()：分割，括号中添加要分割的字符串

‘字符’.join([列表])：括号中添加要拼接的列表，将列表中的数据使用字符拼接起来

二、正则匹配

1、查找

import re
input = "自然语言处理很重 。 12abc789"
pattern = re.compile(r'\s')
re.findall(pattern, input)	# 结果：[' ', ' ']

match是从字符串的开头开始匹配的，匹配不上就为None
从起始位置匹配一个符合规则的字符串，匹配成功返回一个对象，未匹配成功返回None，包含三个参数：
pattern：正则匹配规则
string：要匹配的字符串
flags：匹配模式

match()方法一旦匹配成功，就是一个match object对象，具有以下方法：
group()：返回被re匹配的字符串
start()：返回匹配开始的位置
end()：返回匹配结束的位置
span()：返回一个元组包含匹配（开始，结束）的位置

import re
input2 = "9自然语言处理"
pattern = re.compile(r'\d')
match = re.match(pattern, input2)
print(match.group())	# 结果：9，当match匹配不到时，返回的值为None

search()：在字符串内查找匹配一个符合规则的字符串，只要找到第一个匹配的字符串就返回，如果字符串没有匹配，则返回None，参数与match类似，search同样具有和上面match相同的4个方法。

import re
input2 = "自9然语言处理"
pattern = re.compile(r'\d')
match = re.search(pattern, input2)
match.group()
print(match.span())

re.match() 和 re.search() 的区别

match()函数只检测re是不是在string的开始位置匹配
search()会扫描整个string查找匹配
match()只有在起始位置匹配成功的话才会返回，如果不是起始位置匹配成功的话，match()就返回None

import re
 
print(re.match('super', 'superstition').span())
# 输出结果：(0, 5)
 
print(re.match('super','insuperable'))
# 输出结果：None
 
print(re.search('super','superstition').span())
# 输出结果：(0, 5)
 
print(re.search('super','insuperable').span())
# 输出结果：(2, 7)

2、替换

sub(pattern, replace_str, target)

subn(pattern, replace_str, target)

import re


input2 = "123自然语言处理"
pattern = re.compile(r'\d')
match = re.sub(pattern, "数字", input2)
print(match)	# 数字数字数字自然语言处理

match2 = re.subn(pattern, "数字", input2)
print(match2)	# ('数字数字数字自然语言处理', 3)

3、切片

re.split(pattern, target)

import re


input2 = "自然语言处理123机器学习456深度学习"
pattern = re.compile(r'\d+')
match = re.split(pattern, input2)
print(match)	# ['自然语言处理', '机器学习', '深度学习']

4、提取

这里注意，使用“()”时，会使用类似优先提取模式，只提取出匹配到的第一个对象，比如下面例子种p1匹配到的是123，p2匹配到的就是机器学习了，而不是自然语言处理。

import re


input2 = "自然语言处理123机器学习456深度学习"
pattern = re.compile(r'(?P<p1>\d+)(?P<p2>\D+)')
match = re.search(pattern, input2)
print(match.group("p1"))	# 123
print(match.group("p2"))	# 机器学习

三、nltk工具包

pip install nltk

nltk.download()

分词

import nltk
from nltk.tokenize import word_tokenize
from nltk.text import Text

input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon, We have no play basketball tomorrow."
tokens = word_tokenize(input_str)
token_list = [word.lower() for word in tokens]
print(token_list[:5])

t = Text(token_list)
t.count("good")		# 统计good的个数
t.index("good")		# 查询good的索引位置
t.plot(8)			# 统计排名前8的字符串

停用词：is、the等不太影响句子意思的词语

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stopwords.readme().replace("\n", " ")
stopwords.filedids()	# 查看都有哪种语言的停用词
stopwords.raw("english")		# 查看英语的停用词，stopwords.raw("english").replace("\n", " ")
input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon, We have no play basketball tomorrow."
tokens = word_tokenize(input_str)
test_words = [word.lower() for word in tokens]
test_words_set = set(test_words)
stop_words = test_words_set.intersection(set(stopwords.words("english")))

# 过滤掉停用词
filtered = [w for w in test_words_set if (w not in stopwords.words("english"))]

词性标注

nltk.download()		# 第三个，Averaged Perceptron Tagger
from nltk import pos_tag
tags = pos_tag(tokens)

POS_Tag	指代
CC	并列连词
CD	基数词
DT	限定符
EX	存在词
FW	外来词
IN	介词或从属连词
JJ	形容词
JJR	比较级的形容词
JJS	最高级的形容词
LS	列表项标记
MD	情态动词
NN	名词单数
NNS	名词复数
NNP	专有名词
PDT	前置限定词
POS	所有格结尾
PRP	人称代词
PRP$	所有格代词
RB	副词
RBR	副词比较级
RBS	副词最高级
RP	小品词
UH	感叹词
VB	动词原型
VBD	动词过去式
VBG	动名词或现在分词
VBN	动词过去分词
VBP	非第三人称单数的现在时
VBZ	第三人称单数的现在时
WDT	以wh开头的限定词

分块

from nltk.chunk import RegexpParser

sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("died", "VBD")]
grammer = "my_NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammer)		# 生成规则
result = cp.parse(sentence)		# 进行分类
print(result)

result.draw()	# 调用matplotlib库画出来

命名实体识别

nltk.download()		# maxent_ne_chunke words
from nltk import ne_chunk
sentence = "Edison went to Tsinghua University today."
print(ne_chunk(pos_tag(word_tokenize(sentence))))

数据清洗

import re
from nltk.corpus import stopwords

s = "    RT @Amila #Test\nTom's newly listed Co &amp; Mary's unlisted     Group to supply tech for nlTK. \nh $TSLA $AAPL https://t.co/x34afsfQsh"

# 指定停用词
cache_english_stopwords = stopwords.words("english")

def text_clean(text):
    print("原始数据：", text, "\n")
    
    # 去掉HTML标签（e. g. &amp;）
    text_no_special_entities = re.sub(r'&\w*;|#\w*|@\w*', "", text)
    print("去掉特殊标签后的："， text_no_special_entities, "\n")
    
    # 去掉一些价值符号
    text_no_tickers = re.sub(r'$\w*', "", text_no_special_entities)
    print("去掉价值符号后的：", text_no_tickers, "\n")
    
    # 去掉超链接
    text_no_hyperlinks = re.sub(r'https?://.*/\w*', "", text_no_tickers)
    print("去掉超链接后的："， text_no_hyperlinks, "\n")
    
    # 去掉一些专门名词缩写，简单来说就是字母比较少的词
    text_no_small_words = re.sub(r'\b\w{1,2}\b', "", text_no_hyperlinks)
    print("去掉专门mincing缩写后："， text_no_small_words, "\n")
    
    # 去掉多余的空格
    text_no_whitespace = re.sub(r'\s\s+', " ", text_no_small_words)
    text_no_whitespace = text_no_whitespace.lstrip(" ")
    print("去掉空格后的：", text_no_whitespace, "\n")
    
    # 分词
    tokens = word_tokenize(text_no_whitespace)
    print("分词结果：", tokens, "\n")
    
    # 去掉停用词
    list_no_stopwords = [i for i in tokens if i not in cahce_english_stopwords]
    print("去掉停用词后的结果：", list_no_stopwords, "\n")
    
    # 过滤后结果
    text_filtered = " ".join(list_no_stopwords)
    print("过滤后结果：", text_filtered)
    
text_clean(s)

四、spaCy工具包

文本处理

# https://spacy.io/usage
# pip install -U spacy
# python -m spacy download en    # 用管理员身份打开CMD，下载英文模型,多个python时，到指定的python目录中安装
import spacy
nlp = spacy.load("en")
doc = nlp("Weather is good, very windy and sunny. We have no classes in the afternoon.")
# 分词
for token in doc:
    print(token)
# 分句
for sent in doc.sents:
    print(sent)

词性

词性参考链接：www.winwaed.com/blog/2011/1…

POS Tag	Description	Example
CC	coordinating conjunction	and
CD	cardinal number	1, third
DT	determiner	the
EX	existential there	there is
FW	foreign word	d’hoevre
IN	preposition/subordinating conjunction	in, of, like
JJ	adjective	big
JJR	adjective, comparative	bigger
JJS	adjective, superlative	biggest
LS	list marker	1)
MD	modal	could, will
NN	noun, singular or mass	door
NNS	noun plural	doors
NNP	proper noun, singular	John
NNPS	proper noun, plural	Vikings
PDT	predeterminer	both the boys
POS	possessive ending	friend ‘s
PRP	personal pronoun	I, he, it
PRP$	possessive pronoun	my, his
RB	adverb	however, usually, naturally, here, good
RBR	adverb, comparative	better
RBS	adverb, superlative	best
RP	particle	give up
TO	to	to go, to him
UH	interjection	uhhuhhuhh
VB	verb, base form	take
VBD	verb, past tense	took
VBG	verb, gerund/present participle	taking
VBN	verb, past participle	taken
VBP	verb, sing. present, non-3d	take
VBZ	verb, 3rd person sing. present	takes
WDT	wh-determiner	which
WP	wh-pronoun	who, what
WP$	possessive wh-pronoun	whose
WRB	wh-abverb	where, when

for token in doc:
    print("{}-{}".format(token, token.pos_))

命名体识别

doc_2 = nlp("I went to Paris where I met my old friend Jack from uni.")
for ent in doc_2.ents:
    print("{}-{}.format(ent, ent.label_)")
# Paris-GPE
# Jack_PERSON

from spacy import displacy

doc = nlp("I went to Paris where I met my old friend Jack from uni.")
displacy.render(doc, style="ent", jupyter=True)		# 这里使用的是网页版的jupyter，不是pycharm

找到书中所有任务的名字

def read_file(file_name):
    with open(file_name, "r") as file:
        return file.read()
    
# 加载文本数据
text = read_file("./data/pride_and_pre judice.txt")
processed_text = nlp(text)
sentences = [s for s in processed_text.sents]
print(len(sentences))

from collections import Counter,defaultdict		# 导入计数器
def find_person(doc):
    c = Counter()
    for ent in processed_text.ents:
        if ent.label_ == "PERSON":
            c[ent.lemma_] += 1
    return c.most_common(10)

print(find_person(processed_text))

恐怖袭击分析

def read_file_to_list(file_name):
    with open(file_name, "r") as file:
        return file.readlines()
    
terrorism_articles = read_file_to_list("data/rand-terrorism-dataset.txt")
print(terrorism_articles[:5])

terrorism_articles_nlp = [nlp(art) for art in terrorism_articles]
common_terrorist_groups = [
    "taliban", "al - qaeda", "hamas", "fatah", "plo", "hilad al - rafidayn"
]
common_locations = [
    "iraq", "baghdad", "kirkuk", "mosul", "afghanistan", "kabul", "basra", "palestine", "gaza", "israel", "istanbul", "beirut", "pakistan"
]
location_entity_dict = defaultdict(Counter)

for article in terrorism_articles_nlp:
    
    article_terrorist_groups = [ent.lemma_ for ent in article.ents if ent.labels == "PERSON" or ent.label_ = "ORG"]		# 人或组织
    article_location = [ent.lemma_ for ent in article.ents if ent.label == "GPE"]
    terrorist_common = [ent for ent in article_terrorist_grpups if ent in common_terrorist_groups]
    location_common = [ent for ent in article_locations if ent in common_locations]
    
    for found_entity in terrorist_common:
        for found_location in locations_common:
            location_entity_dict[found_entity][found_location] += 1

print(location_entity_dict)

import pandas as pd
location_entity_df = pd.DataFrame.from_dict(dict(location_entity_dict), dtype=int)
location_entity_df = location_entity_df.fillna(value = 0).astype(int)
print(location_entity_df)


import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 10))
hmap = sns.heatmap(location_entity_df, annot=True, fmt="d", cmap="YlGnBu", cbar=False)

# 添加信息
plt.title("Global Incidents by Terrorist group")
plt.xticks(rotation=30)
plt.show()

五、结巴分词器

分词工具

# pip install jieba
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)		# cut_all为True，使用全模式，为False，使用精确模式
print("全模式：" + "/".join(seg_list))		# 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("精确模式：" + "/".join(seg_list))		# 精确模式

seg_list = jieba.cut("他来到了网易杭研大厦")		# 默认是精确模式
print("，".join(seg_list))


# 全模式：我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
# 精确模式：我/ 来到/ 北京/ 清华大学
# 他， 来到， 了， 网易， 杭研， 大厦

添加自定义字典

text = "故宫的著名景点包括乾清宫、太和殿和黄琉璃瓦等"
# 全模式
seg_list = jieba.cut(text, cut_all=True)
print(u"[全模式]："， "/".join(seg_list))

# 精确模式
seg_list = jieba.cut(text, cut_all=False)
print(u"[精确模式]：", "/".join(seg_list))
# [全模式]：故宫/ 的/ 著名/ 著名景点/ 景点/ 包括/ 乾/ 清宫/ / / 太和/ 太和殿/ 和/ 黄/ 琉璃/ 琉璃瓦/ 等
# [精确模式]：故宫/ 的/ 著名景点/ 包括/ 乾/ 清宫/ 、/ 太和殿/ 和/ 黄/ 琉璃瓦/ 等

jieba.load_userdict("./data/mydict.txt")		# 需UTF-8，可以在另存为里面设置
# 也可以用jieba.add_word("乾清宫")

# 全模式
seg_list = jieba.cut(text, cut_all=True)
print(u"[全模式]："， "/".join(seg_list))

# 精确模式
seg_list = jieba.cut(text, cut_all=False)
print(u"[精确模式]：", "/".join(seg_list))
# [全模式]: 故宫/ 的/ 著名/ 著名景点/ 景点/ 包括/ 乾清宫/ 清宫/ / / 太和/ 太和殿/ 和/ 黄琉璃瓦/ 琉璃/ 琉璃瓦/ 等
# [精确模式]: 故宫/ 的/ 著名景点/ 包括/ 乾清宫/ 、/ 太和殿/ 和/ 黄琉璃瓦/ 等

关键词抽取

import jieba.analyse

seg_list = jieba.cut(text, cut_all=False)
print("分词结果：")
print("/".join(seg_list))

# 获取关键词
tags = jieba.analyse.extract_tags(text, topK=5)
print("关键词：")
print(" ".join(tags))
# 分词结果：
# 故宫/的/著名景点/包括/乾清宫/、/太和殿/和/黄琉璃瓦/等
# 关键词：
# 著名景点 乾清宫 黄琉璃瓦 太和殿 故宫

tags = jieba.analyse.extract_tags(text, topK=5, withWeight=True)
for word, weight in tags:
    print(word, weight)
# 著名景点 2.3167796086666668
# 乾清宫 1.9924612504833332
# 黄琉璃瓦1.9924612504833332
# 太和殿 1.6938346722833335
# 故宫1.5411195503033335

词性标注

import jieba.posseg as pseg
words = pseg.cut("我爱北京天安门")
for word, flag in words:
    print("%s %s" % (word, flag))

# 我 r
# 爱 v
# 北京 ns
# 天安门 ns