线索
- 模型训练
- 模型超参数及优化
- 不同损失函数
- 输出多分类结果
笔记
使用fasttext工具进行文本分类:
-
获取数据
fasttext的标签特征:label作为前缀进行标签的标注工作
-
训练集与验证集划分
-
训练模型
model = fasttext.train_supervised(input = "train.txt")
-
模型调优
-
调整训练轮数 echo
- 轮数不要太多(避免过拟合)
-
调整学习率 lr
-
调整n_gram
-
多标签多分类
model.predict("羽绒服",k=-1,threshold=0.4)
自动超参数优化 :autotuneValidationFile="valid.txt",autotuneDuration=100
-
自文本文件生成fasttext模型需要的数据集
import jieba
import pandas as pd
# 读取数据
df_clothes = pd.read_csv("clothes.csv", encoding='utf-8')
df_clothes = df_clothes.dropna()
df_bag = pd.read_csv("bag.csv", encoding='utf-8')
df_bag = df_bag.dropna()
df_office = pd.read_csv("office.csv", encoding='utf-8')
df_office = df_office.dropna()
df_makeup = pd.read_csv("makeup.csv", encoding='utf-8')
df_makeup = df_makeup.dropna()
df_furniture = pd.read_csv("furniture.csv", encoding='utf-8')
df_furniture = df_furniture.dropna()
# 转换为list列表的形式
clothes = df_clothes.content.values.tolist()[:15]
bag = df_bag.content.values.tolist()[:15]
office = df_office.content.values.tolist()[:15]
makeup = df_makeup.content.values.tolist()[:15]
furniture= df_furniture.content.values.tolist()[:15]
stopwords=pd.read_csv("chineseStopWords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values
def preprocess_text(content_lines, sentences, category):
for line in content_lines:
try:
segs=jieba.lcut(line)
# 去标点、停用词等
segs = list(filter(lambda x:len(x)>1, segs))
segs = list(filter(lambda x:x not in stopwords, segs))
# 将句子处理成 __label__xx 词语 词语 词语 ……的形式
sentences.append("__label__"+str(category)+" , "+" ".join(segs))
except Exception as e:
print(line)
continue
导入fasttext工具,并进行训练:
import fasttext
model = fasttext.train_supervised(input="train.txt")
- 在验证集上进行模型评估
model.test("valid.txt")
- 模型优化
model = fasttext.train_supervised(input="train.txt",autotuneValidationFile="valid.txt",autotuneDuration=100)
model.test("valid.txt")
model = fasttext.train_supervised(input="train.txt",lr=0.01,epoch=10,loss='ova')
model.predict("刺猬 紫檀 餐桌 椅 实木 家具 花 梨木 桌椅 红木 餐台 餐厅 家具",k=-1,threshold=0.4)
模型保存和加载
model.save_model("./model_taobao.bin")
model = fasttext.load_model("./model_taobao.bin")