简单自然语言处理

102 阅读8分钟

简单自然语言处理

一、字符串方法

strip():
1.当括号中什么都不填时,是剔除掉字符串左右两端的空格;
2.当括号中填写字符时,是剔除掉字符串左右两端的该字符;
3.lstrip和rstrip亦然;

replace():替换

find():查找

isalpha():是否是纯字母

isdigit():是否是纯数字

split():分割,括号中添加要分割的字符串

‘字符’.join([列表]):括号中添加要拼接的列表,将列表中的数据使用字符拼接起来

二、正则匹配

1、查找

import re
input = "自然语言处理很重 。 12abc789"
pattern = re.compile(r'\s')
re.findall(pattern, input)	# 结果:[' ', ' ']

match是从字符串的开头开始匹配的,匹配不上就为None
从起始位置匹配一个符合规则的字符串,匹配成功返回一个对象,未匹配成功返回None,包含三个参数:
pattern:正则匹配规则
string:要匹配的字符串
flags:匹配模式

match()方法一旦匹配成功,就是一个match object对象,具有以下方法:
group():返回被re匹配的字符串
start():返回匹配开始的位置
end():返回匹配结束的位置
span():返回一个元组包含匹配(开始,结束)的位置

import re
input2 = "9自然语言处理"
pattern = re.compile(r'\d')
match = re.match(pattern, input2)
print(match.group())	# 结果:9,当match匹配不到时,返回的值为None

search():在字符串内查找匹配一个符合规则的字符串,只要找到第一个匹配的字符串就返回,如果字符串没有匹配,则返回None,参数与match类似,search同样具有和上面match相同的4个方法。

import re
input2 = "自9然语言处理"
pattern = re.compile(r'\d')
match = re.search(pattern, input2)
match.group()
print(match.span())

re.match() 和 re.search() 的区别

  • match()函数只检测re是不是在string的开始位置匹配
  • search()会扫描整个string查找匹配
  • match()只有在起始位置匹配成功的话才会返回,如果不是起始位置匹配成功的话,match()就返回None
import re
 
print(re.match('super', 'superstition').span())
# 输出结果:(0, 5)
 
print(re.match('super','insuperable'))
# 输出结果:None
 
print(re.search('super','superstition').span())
# 输出结果:(0, 5)
 
print(re.search('super','insuperable').span())
# 输出结果:(2, 7)

2、替换

sub(pattern, replace_str, target)

subn(pattern, replace_str, target)

import re


input2 = "123自然语言处理"
pattern = re.compile(r'\d')
match = re.sub(pattern, "数字", input2)
print(match)	# 数字数字数字自然语言处理

match2 = re.subn(pattern, "数字", input2)
print(match2)	# ('数字数字数字自然语言处理', 3)

3、切片

re.split(pattern, target)

import re


input2 = "自然语言处理123机器学习456深度学习"
pattern = re.compile(r'\d+')
match = re.split(pattern, input2)
print(match)	# ['自然语言处理', '机器学习', '深度学习']

4、提取

这里注意,使用“()”时,会使用类似优先提取模式,只提取出匹配到的第一个对象,比如下面例子种p1匹配到的是123,p2匹配到的就是机器学习了,而不是自然语言处理。

import re


input2 = "自然语言处理123机器学习456深度学习"
pattern = re.compile(r'(?P<p1>\d+)(?P<p2>\D+)')
match = re.search(pattern, input2)
print(match.group("p1"))	# 123
print(match.group("p2"))	# 机器学习

三、nltk工具包

pip install nltk

nltk.download()

分词

import nltk
from nltk.tokenize import word_tokenize
from nltk.text import Text

input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon, We have no play basketball tomorrow."
tokens = word_tokenize(input_str)
token_list = [word.lower() for word in tokens]
print(token_list[:5])

t = Text(token_list)
t.count("good")		# 统计good的个数
t.index("good")		# 查询good的索引位置
t.plot(8)			# 统计排名前8的字符串 

停用词:is、the等不太影响句子意思的词语

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stopwords.readme().replace("\n", " ")
stopwords.filedids()	# 查看都有哪种语言的停用词
stopwords.raw("english")		# 查看英语的停用词,stopwords.raw("english").replace("\n", " ")
input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon, We have no play basketball tomorrow."
tokens = word_tokenize(input_str)
test_words = [word.lower() for word in tokens]
test_words_set = set(test_words)
stop_words = test_words_set.intersection(set(stopwords.words("english")))

# 过滤掉停用词
filtered = [w for w in test_words_set if (w not in stopwords.words("english"))]

词性标注

nltk.download()		# 第三个,Averaged Perceptron Tagger
from nltk import pos_tag
tags = pos_tag(tokens)
POS_Tag指代
CC并列连词
CD基数词
DT限定符
EX存在词
FW外来词
IN介词或从属连词
JJ形容词
JJR比较级的形容词
JJS最高级的形容词
LS列表项标记
MD情态动词
NN名词单数
NNS名词复数
NNP专有名词
PDT前置限定词
POS所有格结尾
PRP人称代词
PRP$所有格代词
RB副词
RBR副词比较级
RBS副词最高级
RP小品词
UH感叹词
VB动词原型
VBD动词过去式
VBG动名词或现在分词
VBN动词过去分词
VBP非第三人称单数的现在时
VBZ第三人称单数的现在时
WDT以wh开头的限定词

分块

from nltk.chunk import RegexpParser

sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("died", "VBD")]
grammer = "my_NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammer)		# 生成规则
result = cp.parse(sentence)		# 进行分类
print(result)

result.draw()	# 调用matplotlib库画出来

命名实体识别

nltk.download()		# maxent_ne_chunke words
from nltk import ne_chunk
sentence = "Edison went to Tsinghua University today."
print(ne_chunk(pos_tag(word_tokenize(sentence))))

数据清洗

import re
from nltk.corpus import stopwords

s = "    RT @Amila #Test\nTom's newly listed Co &amp; Mary's unlisted     Group to supply tech for nlTK. \nh $TSLA $AAPL https://t.co/x34afsfQsh"

# 指定停用词
cache_english_stopwords = stopwords.words("english")

def text_clean(text):
    print("原始数据:", text, "\n")
    
    # 去掉HTML标签(e. g. &amp;)
    text_no_special_entities = re.sub(r'&\w*;|#\w*|@\w*', "", text)
    print("去掉特殊标签后的:", text_no_special_entities, "\n")
    
    # 去掉一些价值符号
    text_no_tickers = re.sub(r'$\w*', "", text_no_special_entities)
    print("去掉价值符号后的:", text_no_tickers, "\n")
    
    # 去掉超链接
    text_no_hyperlinks = re.sub(r'https?://.*/\w*', "", text_no_tickers)
    print("去掉超链接后的:", text_no_hyperlinks, "\n")
    
    # 去掉一些专门名词缩写,简单来说就是字母比较少的词
    text_no_small_words = re.sub(r'\b\w{1,2}\b', "", text_no_hyperlinks)
    print("去掉专门mincing缩写后:", text_no_small_words, "\n")
    
    # 去掉多余的空格
    text_no_whitespace = re.sub(r'\s\s+', " ", text_no_small_words)
    text_no_whitespace = text_no_whitespace.lstrip(" ")
    print("去掉空格后的:", text_no_whitespace, "\n")
    
    # 分词
    tokens = word_tokenize(text_no_whitespace)
    print("分词结果:", tokens, "\n")
    
    # 去掉停用词
    list_no_stopwords = [i for i in tokens if i not in cahce_english_stopwords]
    print("去掉停用词后的结果:", list_no_stopwords, "\n")
    
    # 过滤后结果
    text_filtered = " ".join(list_no_stopwords)
    print("过滤后结果:", text_filtered)
    
text_clean(s)

四、spaCy工具包

文本处理

# https://spacy.io/usage
# pip install -U spacy
# python -m spacy download en    # 用管理员身份打开CMD,下载英文模型,多个python时,到指定的python目录中安装
import spacy
nlp = spacy.load("en")
doc = nlp("Weather is good, very windy and sunny. We have no classes in the afternoon.")
# 分词
for token in doc:
    print(token)
# 分句
for sent in doc.sents:
    print(sent)

词性

词性参考链接:www.winwaed.com/blog/2011/1…

POS TagDescriptionExample
CCcoordinating conjunctionand
CDcardinal number1, third
DTdeterminerthe
EXexistential therethere is
FWforeign wordd’hoevre
INpreposition/subordinating conjunctionin, of, like
JJadjectivebig
JJRadjective, comparativebigger
JJSadjective, superlativebiggest
LSlist marker1)
MDmodalcould, will
NNnoun, singular or massdoor
NNSnoun pluraldoors
NNPproper noun, singularJohn
NNPSproper noun, pluralVikings
PDTpredeterminerboth the boys
POSpossessive endingfriend ‘s
PRPpersonal pronounI, he, it
PRP$possessive pronounmy, his
RBadverbhowever, usually, naturally, here, good
RBRadverb, comparativebetter
RBSadverb, superlativebest
RPparticlegive up
TOtoto go, to him
UHinterjectionuhhuhhuhh
VBverb, base formtake
VBDverb, past tensetook
VBGverb, gerund/present participletaking
VBNverb, past participletaken
VBPverb, sing. present, non-3dtake
VBZverb, 3rd person sing. presenttakes
WDTwh-determinerwhich
WPwh-pronounwho, what
WP$possessive wh-pronounwhose
WRBwh-abverbwhere, when
for token in doc:
    print("{}-{}".format(token, token.pos_))

命名体识别

doc_2 = nlp("I went to Paris where I met my old friend Jack from uni.")
for ent in doc_2.ents:
    print("{}-{}.format(ent, ent.label_)")
# Paris-GPE
# Jack_PERSON
from spacy import displacy

doc = nlp("I went to Paris where I met my old friend Jack from uni.")
displacy.render(doc, style="ent", jupyter=True)		# 这里使用的是网页版的jupyter,不是pycharm

找到书中所有任务的名字

def read_file(file_name):
    with open(file_name, "r") as file:
        return file.read()
    
# 加载文本数据
text = read_file("./data/pride_and_pre judice.txt")
processed_text = nlp(text)
sentences = [s for s in processed_text.sents]
print(len(sentences))

from collections import Counter,defaultdict		# 导入计数器
def find_person(doc):
    c = Counter()
    for ent in processed_text.ents:
        if ent.label_ == "PERSON":
            c[ent.lemma_] += 1
    return c.most_common(10)

print(find_person(processed_text))

恐怖袭击分析

def read_file_to_list(file_name):
    with open(file_name, "r") as file:
        return file.readlines()
    
terrorism_articles = read_file_to_list("data/rand-terrorism-dataset.txt")
print(terrorism_articles[:5])

terrorism_articles_nlp = [nlp(art) for art in terrorism_articles]
common_terrorist_groups = [
    "taliban", "al - qaeda", "hamas", "fatah", "plo", "hilad al - rafidayn"
]
common_locations = [
    "iraq", "baghdad", "kirkuk", "mosul", "afghanistan", "kabul", "basra", "palestine", "gaza", "israel", "istanbul", "beirut", "pakistan"
]
location_entity_dict = defaultdict(Counter)

for article in terrorism_articles_nlp:
    
    article_terrorist_groups = [ent.lemma_ for ent in article.ents if ent.labels == "PERSON" or ent.label_ = "ORG"]		# 人或组织
    article_location = [ent.lemma_ for ent in article.ents if ent.label == "GPE"]
    terrorist_common = [ent for ent in article_terrorist_grpups if ent in common_terrorist_groups]
    location_common = [ent for ent in article_locations if ent in common_locations]
    
    for found_entity in terrorist_common:
        for found_location in locations_common:
            location_entity_dict[found_entity][found_location] += 1

print(location_entity_dict)

import pandas as pd
location_entity_df = pd.DataFrame.from_dict(dict(location_entity_dict), dtype=int)
location_entity_df = location_entity_df.fillna(value = 0).astype(int)
print(location_entity_df)


import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 10))
hmap = sns.heatmap(location_entity_df, annot=True, fmt="d", cmap="YlGnBu", cbar=False)

# 添加信息
plt.title("Global Incidents by Terrorist group")
plt.xticks(rotation=30)
plt.show()

五、结巴分词器

分词工具

# pip install jieba
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)		# cut_all为True,使用全模式,为False,使用精确模式
print("全模式:" + "/".join(seg_list))		# 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("精确模式:" + "/".join(seg_list))		# 精确模式

seg_list = jieba.cut("他来到了网易杭研大厦")		# 默认是精确模式
print(",".join(seg_list))


# 全模式:我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
# 精确模式:我/ 来到/ 北京/ 清华大学
# 他, 来到, 了, 网易, 杭研, 大厦

添加自定义字典

text = "故宫的著名景点包括乾清宫、太和殿和黄琉璃瓦等"
# 全模式
seg_list = jieba.cut(text, cut_all=True)
print(u"[全模式]:""/".join(seg_list))

# 精确模式
seg_list = jieba.cut(text, cut_all=False)
print(u"[精确模式]:", "/".join(seg_list))
# [全模式]:故宫/ 的/ 著名/ 著名景点/ 景点/ 包括/ 乾/ 清宫/ / / 太和/ 太和殿/ 和/ 黄/ 琉璃/ 琉璃瓦/ 等
# [精确模式]:故宫/ 的/ 著名景点/ 包括/ 乾/ 清宫/ 、/ 太和殿/ 和/ 黄/ 琉璃瓦/ 等

jieba.load_userdict("./data/mydict.txt")		# 需UTF-8,可以在另存为里面设置
# 也可以用jieba.add_word("乾清宫")

# 全模式
seg_list = jieba.cut(text, cut_all=True)
print(u"[全模式]:""/".join(seg_list))

# 精确模式
seg_list = jieba.cut(text, cut_all=False)
print(u"[精确模式]:", "/".join(seg_list))
# [全模式]: 故宫/ 的/ 著名/ 著名景点/ 景点/ 包括/ 乾清宫/ 清宫/ / / 太和/ 太和殿/ 和/ 黄琉璃瓦/ 琉璃/ 琉璃瓦/ 等
# [精确模式]: 故宫/ 的/ 著名景点/ 包括/ 乾清宫/ 、/ 太和殿/ 和/ 黄琉璃瓦/ 等

image

关键词抽取

import jieba.analyse

seg_list = jieba.cut(text, cut_all=False)
print("分词结果:")
print("/".join(seg_list))

# 获取关键词
tags = jieba.analyse.extract_tags(text, topK=5)
print("关键词:")
print(" ".join(tags))
# 分词结果:
# 故宫/的/著名景点/包括/乾清宫/、/太和殿/和/黄琉璃瓦/等
# 关键词:
# 著名景点 乾清宫 黄琉璃瓦 太和殿 故宫

tags = jieba.analyse.extract_tags(text, topK=5, withWeight=True)
for word, weight in tags:
    print(word, weight)
# 著名景点 2.3167796086666668
# 乾清宫 1.9924612504833332
# 黄琉璃瓦1.9924612504833332
# 太和殿 1.6938346722833335
# 故宫1.5411195503033335

词性标注

import jieba.posseg as pseg
words = pseg.cut("我爱北京天安门")
for word, flag in words:
    print("%s %s" % (word, flag))

# 我 r
# 爱 v
# 北京 ns
# 天安门 ns