简单自然语言处理
一、字符串方法
strip():
1.当括号中什么都不填时,是剔除掉字符串左右两端的空格;
2.当括号中填写字符时,是剔除掉字符串左右两端的该字符;
3.lstrip和rstrip亦然;
replace():替换
find():查找
isalpha():是否是纯字母
isdigit():是否是纯数字
split():分割,括号中添加要分割的字符串
‘字符’.join([列表]):括号中添加要拼接的列表,将列表中的数据使用字符拼接起来
二、正则匹配
1、查找
import re
input = "自然语言处理很重 。 12abc789"
pattern = re.compile(r'\s')
re.findall(pattern, input) # 结果:[' ', ' ']
match是从字符串的开头开始匹配的,匹配不上就为None
从起始位置匹配一个符合规则的字符串,匹配成功返回一个对象,未匹配成功返回None,包含三个参数:
pattern:正则匹配规则
string:要匹配的字符串
flags:匹配模式
match()方法一旦匹配成功,就是一个match object对象,具有以下方法:
group():返回被re匹配的字符串
start():返回匹配开始的位置
end():返回匹配结束的位置
span():返回一个元组包含匹配(开始,结束)的位置
import re
input2 = "9自然语言处理"
pattern = re.compile(r'\d')
match = re.match(pattern, input2)
print(match.group()) # 结果:9,当match匹配不到时,返回的值为None
search():在字符串内查找匹配一个符合规则的字符串,只要找到第一个匹配的字符串就返回,如果字符串没有匹配,则返回None,参数与match类似,search同样具有和上面match相同的4个方法。
import re
input2 = "自9然语言处理"
pattern = re.compile(r'\d')
match = re.search(pattern, input2)
match.group()
print(match.span())
re.match() 和 re.search() 的区别
- match()函数只检测re是不是在string的开始位置匹配
- search()会扫描整个string查找匹配
- match()只有在起始位置匹配成功的话才会返回,如果不是起始位置匹配成功的话,match()就返回None
import re
print(re.match('super', 'superstition').span())
# 输出结果:(0, 5)
print(re.match('super','insuperable'))
# 输出结果:None
print(re.search('super','superstition').span())
# 输出结果:(0, 5)
print(re.search('super','insuperable').span())
# 输出结果:(2, 7)
2、替换
sub(pattern, replace_str, target)
subn(pattern, replace_str, target)
import re
input2 = "123自然语言处理"
pattern = re.compile(r'\d')
match = re.sub(pattern, "数字", input2)
print(match) # 数字数字数字自然语言处理
match2 = re.subn(pattern, "数字", input2)
print(match2) # ('数字数字数字自然语言处理', 3)
3、切片
re.split(pattern, target)
import re
input2 = "自然语言处理123机器学习456深度学习"
pattern = re.compile(r'\d+')
match = re.split(pattern, input2)
print(match) # ['自然语言处理', '机器学习', '深度学习']
4、提取
这里注意,使用“()”时,会使用类似优先提取模式,只提取出匹配到的第一个对象,比如下面例子种p1匹配到的是123,p2匹配到的就是机器学习了,而不是自然语言处理。
import re
input2 = "自然语言处理123机器学习456深度学习"
pattern = re.compile(r'(?P<p1>\d+)(?P<p2>\D+)')
match = re.search(pattern, input2)
print(match.group("p1")) # 123
print(match.group("p2")) # 机器学习
三、nltk工具包
pip install nltk
nltk.download()
分词
import nltk
from nltk.tokenize import word_tokenize
from nltk.text import Text
input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon, We have no play basketball tomorrow."
tokens = word_tokenize(input_str)
token_list = [word.lower() for word in tokens]
print(token_list[:5])
t = Text(token_list)
t.count("good") # 统计good的个数
t.index("good") # 查询good的索引位置
t.plot(8) # 统计排名前8的字符串
停用词:is、the等不太影响句子意思的词语
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stopwords.readme().replace("\n", " ")
stopwords.filedids() # 查看都有哪种语言的停用词
stopwords.raw("english") # 查看英语的停用词,stopwords.raw("english").replace("\n", " ")
input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon, We have no play basketball tomorrow."
tokens = word_tokenize(input_str)
test_words = [word.lower() for word in tokens]
test_words_set = set(test_words)
stop_words = test_words_set.intersection(set(stopwords.words("english")))
# 过滤掉停用词
filtered = [w for w in test_words_set if (w not in stopwords.words("english"))]
词性标注
nltk.download() # 第三个,Averaged Perceptron Tagger
from nltk import pos_tag
tags = pos_tag(tokens)
| POS_Tag | 指代 |
|---|---|
| CC | 并列连词 |
| CD | 基数词 |
| DT | 限定符 |
| EX | 存在词 |
| FW | 外来词 |
| IN | 介词或从属连词 |
| JJ | 形容词 |
| JJR | 比较级的形容词 |
| JJS | 最高级的形容词 |
| LS | 列表项标记 |
| MD | 情态动词 |
| NN | 名词单数 |
| NNS | 名词复数 |
| NNP | 专有名词 |
| PDT | 前置限定词 |
| POS | 所有格结尾 |
| PRP | 人称代词 |
| PRP$ | 所有格代词 |
| RB | 副词 |
| RBR | 副词比较级 |
| RBS | 副词最高级 |
| RP | 小品词 |
| UH | 感叹词 |
| VB | 动词原型 |
| VBD | 动词过去式 |
| VBG | 动名词或现在分词 |
| VBN | 动词过去分词 |
| VBP | 非第三人称单数的现在时 |
| VBZ | 第三人称单数的现在时 |
| WDT | 以wh开头的限定词 |
分块
from nltk.chunk import RegexpParser
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("died", "VBD")]
grammer = "my_NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammer) # 生成规则
result = cp.parse(sentence) # 进行分类
print(result)
result.draw() # 调用matplotlib库画出来
命名实体识别
nltk.download() # maxent_ne_chunke words
from nltk import ne_chunk
sentence = "Edison went to Tsinghua University today."
print(ne_chunk(pos_tag(word_tokenize(sentence))))
数据清洗
import re
from nltk.corpus import stopwords
s = " RT @Amila #Test\nTom's newly listed Co & Mary's unlisted Group to supply tech for nlTK. \nh $TSLA $AAPL https://t.co/x34afsfQsh"
# 指定停用词
cache_english_stopwords = stopwords.words("english")
def text_clean(text):
print("原始数据:", text, "\n")
# 去掉HTML标签(e. g. &)
text_no_special_entities = re.sub(r'&\w*;|#\w*|@\w*', "", text)
print("去掉特殊标签后的:", text_no_special_entities, "\n")
# 去掉一些价值符号
text_no_tickers = re.sub(r'$\w*', "", text_no_special_entities)
print("去掉价值符号后的:", text_no_tickers, "\n")
# 去掉超链接
text_no_hyperlinks = re.sub(r'https?://.*/\w*', "", text_no_tickers)
print("去掉超链接后的:", text_no_hyperlinks, "\n")
# 去掉一些专门名词缩写,简单来说就是字母比较少的词
text_no_small_words = re.sub(r'\b\w{1,2}\b', "", text_no_hyperlinks)
print("去掉专门mincing缩写后:", text_no_small_words, "\n")
# 去掉多余的空格
text_no_whitespace = re.sub(r'\s\s+', " ", text_no_small_words)
text_no_whitespace = text_no_whitespace.lstrip(" ")
print("去掉空格后的:", text_no_whitespace, "\n")
# 分词
tokens = word_tokenize(text_no_whitespace)
print("分词结果:", tokens, "\n")
# 去掉停用词
list_no_stopwords = [i for i in tokens if i not in cahce_english_stopwords]
print("去掉停用词后的结果:", list_no_stopwords, "\n")
# 过滤后结果
text_filtered = " ".join(list_no_stopwords)
print("过滤后结果:", text_filtered)
text_clean(s)
四、spaCy工具包
文本处理
# https://spacy.io/usage
# pip install -U spacy
# python -m spacy download en # 用管理员身份打开CMD,下载英文模型,多个python时,到指定的python目录中安装
import spacy
nlp = spacy.load("en")
doc = nlp("Weather is good, very windy and sunny. We have no classes in the afternoon.")
# 分词
for token in doc:
print(token)
# 分句
for sent in doc.sents:
print(sent)
词性
词性参考链接:www.winwaed.com/blog/2011/1…
| POS Tag | Description | Example |
|---|---|---|
| CC | coordinating conjunction | and |
| CD | cardinal number | 1, third |
| DT | determiner | the |
| EX | existential there | there is |
| FW | foreign word | d’hoevre |
| IN | preposition/subordinating conjunction | in, of, like |
| JJ | adjective | big |
| JJR | adjective, comparative | bigger |
| JJS | adjective, superlative | biggest |
| LS | list marker | 1) |
| MD | modal | could, will |
| NN | noun, singular or mass | door |
| NNS | noun plural | doors |
| NNP | proper noun, singular | John |
| NNPS | proper noun, plural | Vikings |
| PDT | predeterminer | both the boys |
| POS | possessive ending | friend ‘s |
| PRP | personal pronoun | I, he, it |
| PRP$ | possessive pronoun | my, his |
| RB | adverb | however, usually, naturally, here, good |
| RBR | adverb, comparative | better |
| RBS | adverb, superlative | best |
| RP | particle | give up |
| TO | to | to go, to him |
| UH | interjection | uhhuhhuhh |
| VB | verb, base form | take |
| VBD | verb, past tense | took |
| VBG | verb, gerund/present participle | taking |
| VBN | verb, past participle | taken |
| VBP | verb, sing. present, non-3d | take |
| VBZ | verb, 3rd person sing. present | takes |
| WDT | wh-determiner | which |
| WP | wh-pronoun | who, what |
| WP$ | possessive wh-pronoun | whose |
| WRB | wh-abverb | where, when |
for token in doc:
print("{}-{}".format(token, token.pos_))
命名体识别
doc_2 = nlp("I went to Paris where I met my old friend Jack from uni.")
for ent in doc_2.ents:
print("{}-{}.format(ent, ent.label_)")
# Paris-GPE
# Jack_PERSON
from spacy import displacy
doc = nlp("I went to Paris where I met my old friend Jack from uni.")
displacy.render(doc, style="ent", jupyter=True) # 这里使用的是网页版的jupyter,不是pycharm
找到书中所有任务的名字
def read_file(file_name):
with open(file_name, "r") as file:
return file.read()
# 加载文本数据
text = read_file("./data/pride_and_pre judice.txt")
processed_text = nlp(text)
sentences = [s for s in processed_text.sents]
print(len(sentences))
from collections import Counter,defaultdict # 导入计数器
def find_person(doc):
c = Counter()
for ent in processed_text.ents:
if ent.label_ == "PERSON":
c[ent.lemma_] += 1
return c.most_common(10)
print(find_person(processed_text))
恐怖袭击分析
def read_file_to_list(file_name):
with open(file_name, "r") as file:
return file.readlines()
terrorism_articles = read_file_to_list("data/rand-terrorism-dataset.txt")
print(terrorism_articles[:5])
terrorism_articles_nlp = [nlp(art) for art in terrorism_articles]
common_terrorist_groups = [
"taliban", "al - qaeda", "hamas", "fatah", "plo", "hilad al - rafidayn"
]
common_locations = [
"iraq", "baghdad", "kirkuk", "mosul", "afghanistan", "kabul", "basra", "palestine", "gaza", "israel", "istanbul", "beirut", "pakistan"
]
location_entity_dict = defaultdict(Counter)
for article in terrorism_articles_nlp:
article_terrorist_groups = [ent.lemma_ for ent in article.ents if ent.labels == "PERSON" or ent.label_ = "ORG"] # 人或组织
article_location = [ent.lemma_ for ent in article.ents if ent.label == "GPE"]
terrorist_common = [ent for ent in article_terrorist_grpups if ent in common_terrorist_groups]
location_common = [ent for ent in article_locations if ent in common_locations]
for found_entity in terrorist_common:
for found_location in locations_common:
location_entity_dict[found_entity][found_location] += 1
print(location_entity_dict)
import pandas as pd
location_entity_df = pd.DataFrame.from_dict(dict(location_entity_dict), dtype=int)
location_entity_df = location_entity_df.fillna(value = 0).astype(int)
print(location_entity_df)
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 10))
hmap = sns.heatmap(location_entity_df, annot=True, fmt="d", cmap="YlGnBu", cbar=False)
# 添加信息
plt.title("Global Incidents by Terrorist group")
plt.xticks(rotation=30)
plt.show()
五、结巴分词器
分词工具
# pip install jieba
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=True) # cut_all为True,使用全模式,为False,使用精确模式
print("全模式:" + "/".join(seg_list)) # 全模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("精确模式:" + "/".join(seg_list)) # 精确模式
seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式
print(",".join(seg_list))
# 全模式:我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
# 精确模式:我/ 来到/ 北京/ 清华大学
# 他, 来到, 了, 网易, 杭研, 大厦
添加自定义字典
text = "故宫的著名景点包括乾清宫、太和殿和黄琉璃瓦等"
# 全模式
seg_list = jieba.cut(text, cut_all=True)
print(u"[全模式]:", "/".join(seg_list))
# 精确模式
seg_list = jieba.cut(text, cut_all=False)
print(u"[精确模式]:", "/".join(seg_list))
# [全模式]:故宫/ 的/ 著名/ 著名景点/ 景点/ 包括/ 乾/ 清宫/ / / 太和/ 太和殿/ 和/ 黄/ 琉璃/ 琉璃瓦/ 等
# [精确模式]:故宫/ 的/ 著名景点/ 包括/ 乾/ 清宫/ 、/ 太和殿/ 和/ 黄/ 琉璃瓦/ 等
jieba.load_userdict("./data/mydict.txt") # 需UTF-8,可以在另存为里面设置
# 也可以用jieba.add_word("乾清宫")
# 全模式
seg_list = jieba.cut(text, cut_all=True)
print(u"[全模式]:", "/".join(seg_list))
# 精确模式
seg_list = jieba.cut(text, cut_all=False)
print(u"[精确模式]:", "/".join(seg_list))
# [全模式]: 故宫/ 的/ 著名/ 著名景点/ 景点/ 包括/ 乾清宫/ 清宫/ / / 太和/ 太和殿/ 和/ 黄琉璃瓦/ 琉璃/ 琉璃瓦/ 等
# [精确模式]: 故宫/ 的/ 著名景点/ 包括/ 乾清宫/ 、/ 太和殿/ 和/ 黄琉璃瓦/ 等
关键词抽取
import jieba.analyse
seg_list = jieba.cut(text, cut_all=False)
print("分词结果:")
print("/".join(seg_list))
# 获取关键词
tags = jieba.analyse.extract_tags(text, topK=5)
print("关键词:")
print(" ".join(tags))
# 分词结果:
# 故宫/的/著名景点/包括/乾清宫/、/太和殿/和/黄琉璃瓦/等
# 关键词:
# 著名景点 乾清宫 黄琉璃瓦 太和殿 故宫
tags = jieba.analyse.extract_tags(text, topK=5, withWeight=True)
for word, weight in tags:
print(word, weight)
# 著名景点 2.3167796086666668
# 乾清宫 1.9924612504833332
# 黄琉璃瓦1.9924612504833332
# 太和殿 1.6938346722833335
# 故宫1.5411195503033335
词性标注
import jieba.posseg as pseg
words = pseg.cut("我爱北京天安门")
for word, flag in words:
print("%s %s" % (word, flag))
# 我 r
# 爱 v
# 北京 ns
# 天安门 ns