商品文本信息分类模型

123 阅读1分钟

线索

  1. 模型训练
  2. 模型超参数及优化
  3. 不同损失函数
  4. 输出多分类结果

笔记

使用fasttext工具进行文本分类:

  • 获取数据

    fasttext的标签特征:label作为前缀进行标签的标注工作

  • 训练集与验证集划分

  • 训练模型

    model = fasttext.train_supervised(input = "train.txt")

  • 模型调优

    1. 调整训练轮数 echo

      1. 轮数不要太多(避免过拟合)
    2. 调整学习率 lr

    3. 调整n_gram

    4. 多标签多分类

      model.predict("羽绒服",k=-1,threshold=0.4)

    自动超参数优化 :autotuneValidationFile="valid.txt",autotuneDuration=100

自文本文件生成fasttext模型需要的数据集

import jieba
import pandas as pd

# 读取数据
df_clothes = pd.read_csv("clothes.csv", encoding='utf-8')
df_clothes = df_clothes.dropna()

df_bag = pd.read_csv("bag.csv", encoding='utf-8')
df_bag = df_bag.dropna()

df_office = pd.read_csv("office.csv", encoding='utf-8')
df_office = df_office.dropna()

df_makeup = pd.read_csv("makeup.csv", encoding='utf-8')
df_makeup = df_makeup.dropna()

df_furniture = pd.read_csv("furniture.csv", encoding='utf-8')
df_furniture = df_furniture.dropna()
# 转换为list列表的形式
clothes = df_clothes.content.values.tolist()[:15]
bag = df_bag.content.values.tolist()[:15]
office = df_office.content.values.tolist()[:15]
makeup = df_makeup.content.values.tolist()[:15]
furniture= df_furniture.content.values.tolist()[:15]

stopwords=pd.read_csv("chineseStopWords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values

def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        try:
            segs=jieba.lcut(line)
            # 去标点、停用词等
            segs = list(filter(lambda x:len(x)>1, segs))
            segs = list(filter(lambda x:x not in stopwords, segs))
            # 将句子处理成  __label__xx 词语 词语 词语 ……的形式
            sentences.append("__label__"+str(category)+" , "+" ".join(segs))
        except Exception as e:
            print(line)
            continue

导入fasttext工具,并进行训练:

import fasttext
model = fasttext.train_supervised(input="train.txt")

  • 在验证集上进行模型评估
model.test("valid.txt")

  • 模型优化
model = fasttext.train_supervised(input="train.txt",autotuneValidationFile="valid.txt",autotuneDuration=100)
model.test("valid.txt")

model = fasttext.train_supervised(input="train.txt",lr=0.01,epoch=10,loss='ova')
model.predict("刺猬 紫檀 餐桌 椅 实木 家具 花 梨木 桌椅 红木 餐台 餐厅 家具",k=-1,threshold=0.4)

模型保存和加载

model.save_model("./model_taobao.bin")
model = fasttext.load_model("./model_taobao.bin")