Intro to NLP
使用 spbCy 生成一个 doc 对象
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("Tea is healthy and calming, don't you think?")
获取 token,即文字和标点
for token in doc:
print(token)
打印 token 的 Lemma 和 判断是否是 stopword
for token in doc:
print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")
Pattern Matching
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", patterns)
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
"photography tests pitting the iPhone 11 Pro against the "
"Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.")
matches = matcher(text_doc)
print(matches)
match_id, start, end = matches[0]
Text Classification 文字分类
这一部分教程太过浅显,用处不大
先load 数据:
import pandas as pd
# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('../input/nlp-course/spam.csv')
spam.head(10)
创建模型,添加分类 pipe
import spacy
# Create an empty model
nlp = spacy.blank("en")
# Add the TextCategorizer to the empty model
textca
t = nlp.add_pipe("textcat")
给分类器 添加 labels
# Add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")
组织数据
train_texts = spam['text'].values
train_labels = [{'cats': {'ham': label == 'ham',
'spam': label == 'spam'}}
for label in spam['label']]
train_data = list(zip(train_texts, train_labels))
train_data[:3]
训练 model
import random
random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()
losses = {}
for epoch in range(10):
random.shuffle(train_data)
# Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)
# Iterate through minibatches
for batch in batches:
for text, labels in batch:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, labels)
nlp.update([example], sgd=optimizer, losses=losses)
print(losses)
预测
texts = ["Are you ready for the tea party????? It's gonna be wild",
"URGENT Reply to this message for GUARANTEED FREE TEA" ]
docs = [nlp.tokenizer(text) for text in texts]
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores = textcat.predict(docs)
print(scores)
# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])
Word Embeddings
即把每个词语用多维向量的形式表示
一个词语一个向量
# Disabling other pipes because we don't need them and it'll speed up this part a bit
text = "These vectors can be used as features for machine learning models."
with nlp.disable_pipes():
vectors = np.array([token.vector for token in nlp(text)])