【深度学习】写诗机器人tensorflow实现

1,382 阅读13分钟

代码地址:github.com/hjptriplebe…, 欢迎fork, star

机器人命名MC胖虎,目前只是最简单粗暴的方法,使用tensorflow完成,有些像人工智障,符合胖虎的人物设定,看一些效果:


LSTM的原理网上资料很多,不了解的可以看这里:www.jianshu.com/p/9dc9f41f0…

本文以讲解写诗机器人实现为主,不会讲太多理论和tensorflow使用方法,好下面开始。

训练数据预处理

采用3w首唐诗作为训练数据,在github上dataset文件夹下可以看到,唐诗格式为”题目:诗句“,如下所示:


我们首先通过”:“将题目和内容分离,然后做数据清洗过滤一些不好的训练样本,包含特殊符号、字数太少或太多的都要去除,最后在诗的前后分别加上开始和结束符号,用来告诉LSTM这是开头和结尾,这里用方括号表示。

  1. poems = []  
  2. file = open(filename, "r")  
  3. for line in file:  #every line is a poem  
  4.     #print(line)  
  5.     title, poem = line.strip().split(":")  #get title and poem  
  6.     poem = poem.replace(' ','')  
  7.     if '_' in poem or  '《' in poem or '[' in poem  or '(' in poem or '('  in poem:  
  8.         continue  
  9.     if len(poem) < 10 or len(poem) >  128:  #filter poem  
  10.         continue  
  11.     poem = '[' + poem + ']' #add start and end signs  
  12.     poems.append(poem)  
poems = []
file = open(filename, "r")
for line in file:  #every line is a poem
    #print(line)
    title, poem = line.strip().split(":")  #get title and poem
    poem = poem.replace(' ','')
    if '_' in poem or '《' in poem or '[' in poem or '(' in poem or '(' in poem:
        continue
    if len(poem) < 10 or len(poem) > 128:  #filter poem
        continue
    poem = '[' + poem + ']' #add start and end signs
    poems.append(poem)
然后统计每个字出现的次数,并删除出现次数较少的生僻字

  1. #counting words  
  2. allWords = {}  
  3. for poem in poems:  
  4.     for word in poem:  
  5.         if word not in allWords:  
  6.             allWords[word] = 1  
  7.         else:  
  8.             allWords[word] += 1  
  9. # erase words which are not common  
  10. erase = []  
  11. for key in allWords:  
  12.     if allWords[key] < 2:  
  13.         erase.append(key)  
  14. for key in erase:  
  15.     del allWords[key]  
#counting words
allWords = {}
for poem in poems:
    for word in poem:
        if word not in allWords:
            allWords[word] = 1
        else:
            allWords[word] += 1
# erase words which are not common
erase = []
for key in allWords:
    if allWords[key] < 2:
        erase.append(key)
for key in erase:
    del allWords[key]
根据字出现的次数排序,建立字到ID的映射。为什么需要排序呢?排序后的ID从一定程度上表示了字的出现频率,两者之间有一定关系,比不排序直接映射更容易使模型学出规律。

添加空格字符,因为诗的长度不一致,需要用空格填补,所以留出空格的ID。最后将诗转成字向量的形式。

  1. wordPairs = sorted(allWords.items(), key = lambda x: -x[1])  
  2. words, a= zip(*wordPairs)  
  3. words += (" ", )  
  4. wordToID = dict(zip(words, range(len(words)))) #word to ID  
  5. wordTOIDFun = lambda A: wordToID.get(A, len(words))  
  6. poemsVector = [([wordTOIDFun(word) for word in poem]) for poem  in poems] # poem to vector  
wordPairs = sorted(allWords.items(), key = lambda x: -x[1])
words, a= zip(*wordPairs)
words += (" ", )
wordToID = dict(zip(words, range(len(words)))) #word to ID
wordTOIDFun = lambda A: wordToID.get(A, len(words))
poemsVector = [([wordTOIDFun(word) for word in poem]) for poem in poems] # poem to vector
接下来构建训练batch,每一个batch中所有的诗都要补空格直到长度达到最长诗的长度。因为补的都是空格,所以模型可以学出这样一个规律:空格后面都是接着空格。X和Y分别表示输入和输出,输出为输入的错位,即模型看到字得到的输出应该为下一个字。

这里注意一定要用np.copy,坑死我了!

  1. #padding length to batchMaxLength  
  2. batchNum = (len(poemsVector) - 1) // batchSize  
  3. X = []  
  4. Y = []  
  5. #create batch  
  6. for i in range(batchNum):  
  7.     batch = poemsVector[i * batchSize: (i + 1) * batchSize]  
  8.     maxLength = max([len(vector) for vector in batch])  
  9.     temp = np.full((batchSize, maxLength), wordTOIDFun(" "), np.int32)  
  10.     for j in range(batchSize):  
  11.         temp[j, :len(batch[j])] = batch[j]  
  12.     X.append(temp)  
  13.     temp2 = np.copy(temp) #copy!!!!!!  
  14.     temp2[:, :-1] = temp[:, 1:]  
  15.     Y.append(temp2)  
#padding length to batchMaxLength
batchNum = (len(poemsVector) - 1) // batchSize
X = []
Y = []
#create batch
for i in range(batchNum):
    batch = poemsVector[i * batchSize: (i + 1) * batchSize]
    maxLength = max([len(vector) for vector in batch])
    temp = np.full((batchSize, maxLength), wordTOIDFun(" "), np.int32)
    for j in range(batchSize):
        temp[j, :len(batch[j])] = batch[j]
    X.append(temp)
    temp2 = np.copy(temp) #copy!!!!!!
    temp2[:, :-1] = temp[:, 1:]
    Y.append(temp2)

搭建模型

搭建一个LSTM模型,后接softmax,输出为每一个字出现的概率。这里对着LSTM模板抄一份,改改参数就好了。

  1. with tf.variable_scope("embedding"): #embedding  
  2.     embedding = tf.get_variable("embedding", [wordNum, hidden_units], dtype = tf.float32)  
  3.     inputbatch = tf.nn.embedding_lookup(embedding, gtX)  
  4.   
  5. basicCell = tf.contrib.rnn.BasicLSTMCell(hidden_units, state_is_tuple = True)  
  6. stackCell = tf.contrib.rnn.MultiRNNCell([basicCell] * layers)  
  7. initState = stackCell.zero_state(np.shape(gtX)[0], tf.float32)  
  8. outputs, finalState = tf.nn.dynamic_rnn(stackCell, inputbatch, initial_state = initState)  
  9. outputs = tf.reshape(outputs, [-1, hidden_units])  
  10.   
  11. with tf.variable_scope("softmax"):  
  12.     w = tf.get_variable("w", [hidden_units, wordNum])  
  13.     b = tf.get_variable("b", [wordNum])  
  14.     logits = tf.matmul(outputs, w) + b  
  15.   
  16. probs = tf.nn.softmax(logits)  
with tf.variable_scope("embedding"): #embedding
    embedding = tf.get_variable("embedding", [wordNum, hidden_units], dtype = tf.float32)
    inputbatch = tf.nn.embedding_lookup(embedding, gtX)

basicCell = tf.contrib.rnn.BasicLSTMCell(hidden_units, state_is_tuple = True)
stackCell = tf.contrib.rnn.MultiRNNCell([basicCell] * layers)
initState = stackCell.zero_state(np.shape(gtX)[0], tf.float32)
outputs, finalState = tf.nn.dynamic_rnn(stackCell, inputbatch, initial_state = initState)
outputs = tf.reshape(outputs, [-1, hidden_units])

with tf.variable_scope("softmax"):
    w = tf.get_variable("w", [hidden_units, wordNum])
    b = tf.get_variable("b", [wordNum])
    logits = tf.matmul(outputs, w) + b

probs = tf.nn.softmax(logits)

模型训练

先定义输入输出,构建模型,然后设置损失函数、学习率等参数。

  1. gtX = tf.placeholder(tf.int32, shape=[batchSize, None])  # input  
  2. gtY = tf.placeholder(tf.int32, shape=[batchSize, None])  # output  
  3. logits, probs, a, b, c = buildModel(wordNum, gtX)  
  4. targets = tf.reshape(gtY, [-1])  
  5. #loss  
  6. loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [targets],  
  7.                                                           [tf.ones_like(targets, dtype=tf.float32)], wordNum)  
  8. cost = tf.reduce_mean(loss)  
  9. tvars = tf.trainable_variables()  
  10. grads, a = tf.clip_by_global_norm(tf.gradients(cost, tvars), 5)  
  11. learningRate = learningRateBase  
  12. optimizer = tf.train.AdamOptimizer(learningRate)  
  13. trainOP = optimizer.apply_gradients(zip(grads, tvars))  
  14. globalStep = 0  
gtX = tf.placeholder(tf.int32, shape=[batchSize, None])  # input
gtY = tf.placeholder(tf.int32, shape=[batchSize, None])  # output
logits, probs, a, b, c = buildModel(wordNum, gtX)
targets = tf.reshape(gtY, [-1])
#loss
loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [targets],
                                                          [tf.ones_like(targets, dtype=tf.float32)], wordNum)
cost = tf.reduce_mean(loss)
tvars = tf.trainable_variables()
grads, a = tf.clip_by_global_norm(tf.gradients(cost, tvars), 5)
learningRate = learningRateBase
optimizer = tf.train.AdamOptimizer(learningRate)
trainOP = optimizer.apply_gradients(zip(grads, tvars))
globalStep = 0
然后开始训练,训练时先寻找能否找到检查点,找到则还原,否则重新训练。然后按照batch一步步读入数据训练,学习率逐渐递减,每隔几个step就保存一下模型。

  1. with tf.Session() as sess:  
  2.     sess.run(tf.global_variables_initializer())  
  3.     saver = tf.train.Saver()  
  4.     if reload:  
  5.         checkPoint = tf.train.get_checkpoint_state(checkpointsPath)  
  6.         # if have checkPoint, restore checkPoint  
  7.         if checkPoint and checkPoint.model_checkpoint_path:  
  8.             saver.restore(sess, checkPoint.model_checkpoint_path)  
  9.             print("restored %s" % checkPoint.model_checkpoint_path)  
  10.         else:  
  11.             print("no checkpoint found!")  
  12.   
  13.     for epoch in range(epochNum):  
  14.         if globalStep % learningRateDecreaseStep == 0 #learning rate decrease by epoch  
  15.             learningRate = learningRateBase * (0.95 ** epoch)  
  16.         epochSteps = len(X) # equal to batch  
  17.         for step, (x, y) in enumerate(zip(X, Y)):  
  18.             #print(x)  
  19.             #print(y)  
  20.             globalStep = epoch * epochSteps + step  
  21.             a, loss = sess.run([trainOP, cost], feed_dict = {gtX:x, gtY:y})  
  22.             print("epoch: %d steps:%d/%d loss:%3f" % (epoch,step,epochSteps,loss))  
  23.             if globalStep%1000==0:  
  24.                 print("save model")  
  25.                 saver.save(sess,checkpointsPath + "/poem",global_step=epoch)  
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver = tf.train.Saver()
    if reload:
        checkPoint = tf.train.get_checkpoint_state(checkpointsPath)
        # if have checkPoint, restore checkPoint
        if checkPoint and checkPoint.model_checkpoint_path:
            saver.restore(sess, checkPoint.model_checkpoint_path)
            print("restored %s" % checkPoint.model_checkpoint_path)
        else:
            print("no checkpoint found!")

    for epoch in range(epochNum):
        if globalStep % learningRateDecreaseStep == 0: #learning rate decrease by epoch
            learningRate = learningRateBase * (0.95 ** epoch)
        epochSteps = len(X) # equal to batch
        for step, (x, y) in enumerate(zip(X, Y)):
            #print(x)
            #print(y)
            globalStep = epoch * epochSteps + step
            a, loss = sess.run([trainOP, cost], feed_dict = {gtX:x, gtY:y})
            print("epoch: %d steps:%d/%d loss:%3f" % (epoch,step,epochSteps,loss))
            if globalStep%1000==0:
                print("save model")
                saver.save(sess,checkpointsPath + "/poem",global_step=epoch)

自动写诗

在自动写诗之前,我们需要定义一个输出概率对应到单词的功能函数,为了避免每次生成的诗都一样,需要引入一定的随机性。不选择输出概率最高的字,而是将概率映射到一个区间上,在区间上随机采样,输出概率大的字对应的区间大,被采样的概率也大,但胖虎也有小概率会选择其他字。因为每一个字都有这样的随机性,所以每次作出的诗都完全不一样。

  1. def probsToWord(weights, words):  
  2.     """probs to word"""  
  3.     t = np.cumsum(weights) #prefix sum  
  4.     s = np.sum(weights)  
  5.     coff = np.random.rand(1)  
  6.     index = int(np.searchsorted(t, coff * s)) # large margin has high possibility to be sampled  
  7.     return words[index]  
def probsToWord(weights, words):
    """probs to word"""
    t = np.cumsum(weights) #prefix sum
    s = np.sum(weights)
    coff = np.random.rand(1)
    index = int(np.searchsorted(t, coff * s)) # large margin has high possibility to be sampled
    return words[index]
然后开始写诗,首先仍然是构建模型,定义相关参数,加载checkpoint。

  1. gtX = tf.placeholder(tf.int32, shape=[1None])  # input  
  2. logits, probs, stackCell, initState, finalState = buildModel(wordNum, gtX)  
  3. with tf.Session() as sess:  
  4.     sess.run(tf.global_variables_initializer())  
  5.     saver = tf.train.Saver()  
  6.     checkPoint = tf.train.get_checkpoint_state(checkpointsPath)  
  7.     # if have checkPoint, restore checkPoint  
  8.     if checkPoint and checkPoint.model_checkpoint_path:  
  9.         saver.restore(sess, checkPoint.model_checkpoint_path)  
  10.         print("restored %s" % checkPoint.model_checkpoint_path)  
  11.     else:  
  12.         print("no checkpoint found!")  
  13.         exit(0)  
gtX = tf.placeholder(tf.int32, shape=[1, None])  # input
logits, probs, stackCell, initState, finalState = buildModel(wordNum, gtX)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver = tf.train.Saver()
    checkPoint = tf.train.get_checkpoint_state(checkpointsPath)
    # if have checkPoint, restore checkPoint
    if checkPoint and checkPoint.model_checkpoint_path:
        saver.restore(sess, checkPoint.model_checkpoint_path)
        print("restored %s" % checkPoint.model_checkpoint_path)
    else:
        print("no checkpoint found!")
        exit(0)
生成generateNum这么多首诗,每首诗以左中括号开始,以右中括号或空格结束,每次生成的prob用probsToWord方法转成字。

  1. poems = []  
  2. for i in range(generateNum):  
  3.     state = sess.run(stackCell.zero_state(1, tf.float32))  
  4.     x = np.array([[wordToID['[']]]) # init start sign  
  5.     probs1, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})  
  6.     word = probsToWord(probs1, words)  
  7.     poem = ''  
  8.     while word != ']' and word !=  ' ':  
  9.         poem += word  
  10.         if word == '。':  
  11.             poem += '\n'  
  12.         x = np.array([[wordToID[word]]])  
  13.         #print(word)  
  14.         probs2, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})  
  15.         word = probsToWord(probs2, words)  
  16.     print(poem)  
  17.     poems.append(poem)  
poems = []
for i in range(generateNum):
    state = sess.run(stackCell.zero_state(1, tf.float32))
    x = np.array([[wordToID['[']]]) # init start sign
    probs1, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})
    word = probsToWord(probs1, words)
    poem = ''
    while word != ']' and word != ' ':
        poem += word
        if word == '。':
            poem += '\n'
        x = np.array([[wordToID[word]]])
        #print(word)
        probs2, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})
        word = probsToWord(probs2, words)
    print(poem)
    poems.append(poem)
还可以写藏头诗,前面的搭建模型,加载checkpoint等内容一样,作诗部分,每遇到标点符号,人为控制下一个输入的字为指定的字就可以了。需要注意,在标点符号后,因为没有选择模型输出的字,所以需要将state往前滚动一下,直接跳过这个字的生成。

  1. flag = 1  
  2. endSign = {-1","1"。"}  
  3. poem = ''  
  4. state = sess.run(stackCell.zero_state(1, tf.float32))  
  5. x = np.array([[wordToID['[']]])  
  6. probs1, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})  
  7. for c in characters:  
  8.     word = c  
  9.     flag = -flag  
  10.     while word != ']' and word !=  ',' and word != '。' and word !=  ' ':  
  11.         poem += word  
  12.         x = np.array([[wordToID[word]]])  
  13.         probs2, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})  
  14.         word = probsToWord(probs2, words)  
  15.   
  16.     poem += endSign[flag]  
  17.     # keep the context, state must be updated  
  18.     if endSign[flag] == '。':  
  19.         probs2, state = sess.run([probs, finalState],  
  20.                                  feed_dict={gtX: np.array([[wordToID["。"]]]), initState: state})  
  21.         poem += '\n'  
  22.     else:  
  23.         probs2, state = sess.run([probs, finalState],  
  24.                                  feed_dict={gtX: np.array([[wordToID[","]]]), initState: state})  
  25.   
  26. print(characters)  
  27. print(poem)  
flag = 1
endSign = {-1: ",", 1: "。"}
poem = ''
state = sess.run(stackCell.zero_state(1, tf.float32))
x = np.array([[wordToID['[']]])
probs1, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})
for c in characters:
    word = c
    flag = -flag
    while word != ']' and word != ',' and word != '。' and word != ' ':
        poem += word
        x = np.array([[wordToID[word]]])
        probs2, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})
        word = probsToWord(probs2, words)

    poem += endSign[flag]
    # keep the context, state must be updated
    if endSign[flag] == '。':
        probs2, state = sess.run([probs, finalState],
                                 feed_dict={gtX: np.array([[wordToID["。"]]]), initState: state})
        poem += '\n'
    else:
        probs2, state = sess.run([probs, finalState],
                                 feed_dict={gtX: np.array([[wordToID[","]]]), initState: state})

print(characters)
print(poem)
大约在GPU上训练20epoch效果就不错了!

代码地址:github.com/hjptriplebe… , 欢迎fork, star

估计后续还会出看图写诗机器人-MC胖虎2.0

说了这么多胖虎该生气了!