看到手册里面的标题就激动,难道这就是传说中的NLP问题,研究了下发现是根据影评,判断电影好坏的一个二分类问题。这里需要说明下,由于前面学习了手册里面的基本分类,这个上手就很省事。如果有人看到了我写的东西,建议不要直接拷贝代码,还是根据手册一步一步做,然后自己规整代码,感觉学到的东西会多些。如果google里面写手册的这个大神直接把现成的代码直接放上去,我可能也就直接拷贝,然后一运行,也学不牢靠,越看越觉得这手册写的好,十分感谢!

文本分类源码

(venv) xianfa@xianfa:~/machinelearning/study$ cat kerastextclassification.py 
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

#根据索引单词组成句子
def decode_review(reverse_word_index, text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

def my_decode_review(reverse_word_index, text):
    result = ''
    for i in text:
        result += ' ' + reverse_word_index[i]
    return result

#画出训练和验证的val以及loss曲线
def drawfigure(epochs, bodata, bolabel, bdata, blabel, title, xlabel, ylabel, savefilename):
    plt.clf()   # clear figure
    # "bo" is for "blue dot"
    plt.plot(epochs, bodata, 'bo', label=bolabel)
    # b is for "solid blue line"
    plt.plot(epochs, bdata, 'b', label=blabel)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.legend()
    plt.savefig(savefilename)


#获取训练和测试数据
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

#打印一个实际的测试数据,以便了解原始数据形式
exampleindex = 10
print('\ntrain_data example:' + str(train_data[exampleindex]) + '\n')
print('\ntrain_label example:' + str(train_labels[exampleindex]) + '\n')

#key:单词 value:单词索引整数值
word_index = imdb.get_word_index()

#0-3索引值保留,原来的索引值均加3,并加上保留的三个,这里有个疑问,在数据处理的时候为什么不直接处理好,在这里处理下感觉怪怪的。
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

#一个反向索引,key:单词索引整数值 value:单词
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

#根据句子单词索引,建立原来的句子,打印出来,直观的了解数据。
examplecomment = decode_review(reverse_word_index, train_data[exampleindex])
#examplecomment = my_decode_review(reverse_word_index, train_data[exampleindex])
print('example comment:' + examplecomment + '\n')

#格式化处理数据,统一处理为256元素,长度不足的后面补0,大于256的,截取后面的256个单词,这样比较合理,一般影评前面会讲剧情,后面说是否推荐。
#当然了对于先说是否推荐,后面说内容的数据,显然这样处理就不合适
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
        value=word_index["<PAD>"],
        padding='post',
        maxlen=256)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
        value=word_index["<PAD>"],
        padding='post',
        maxlen=256)

#input shape is the vocabulary count used for the movie reviews (10,000 words)
#这里不明白,这里是指用的词汇量不超过10000个,还是指的单篇评论不超过10000单词
vocab_size = 10000

#神经网络模型 对比基本分类,又学习到一种写法。
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

#打印出模型概要信息,这个好,基本分类也要加上
model.summary()

#编译
model.compile(optimizer=tf.train.AdamOptimizer(),
        loss='binary_crossentropy',
        metrics=['accuracy'])

#前一万个数据作为验证数据,后面的数据作为训练数据 训练模型
x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

history = model.fit(partial_x_train,
        partial_y_train,
        epochs=40,
        batch_size=512,
        validation_data=(x_val, y_val),
        verbose=1)

#评估测试数据 并打印评估结果
results = model.evaluate(test_data, test_labels)
print(results)

#作图需要的数据
history_dict = history.history
acc = history_dict['acc']
val_acc = history_dict['val_acc']
loss = history_dict['loss']
val_loss = history_dict['val_loss']
epochs = range(1, len(acc) + 1)

#画出训练和验证的loss和acc曲线
drawfigure(epochs, loss, 'Training loss', val_loss, 'Validation loss', 'Training and validation loss', 'Epochs', 'Loss', './epochs-loss.jpg')
drawfigure(epochs, acc, 'Training acc', val_acc, 'Validation acc', 'Training and validation accuracy', 'Epochs', 'Accuracy', './epochs-accuracy.jpg')

源码分析

执行结果

(venv) xianfa@xianfa:~/machinelearning/study$ python kerastextclassification.py 

train_data example:[1, 785, 189, 438, 47, 110, 142, 7, 6, 7475, 120, 4, 236, 378, 7, 153, 19, 87, 108, 141, 17, 1004, 5, 2, 883, 2, 23, 8, 4, 136, 2, 2, 4, 7475, 43, 1076, 21, 1407, 419, 5, 5202, 120, 91, 682, 189, 2818, 5, 9, 1348, 31, 7, 4, 118, 785, 189, 108, 126, 93, 2, 16, 540, 324, 23, 6, 364, 352, 21, 14, 9, 93, 56, 18, 11, 230, 53, 771, 74, 31, 34, 4, 2834, 7, 4, 22, 5, 14, 11, 471, 9, 2, 34, 4, 321, 487, 5, 116, 15, 6584, 4, 22, 9, 6, 2286, 4, 114, 2679, 23, 107, 293, 1008, 1172, 5, 328, 1236, 4, 1375, 109, 9, 6, 132, 773, 2, 1412, 8, 1172, 18, 7865, 29, 9, 276, 11, 6, 2768, 19, 289, 409, 4, 5341, 2140, 2, 648, 1430, 2, 8914, 5, 27, 3000, 1432, 7130, 103, 6, 346, 137, 11, 4, 2768, 295, 36, 7740, 725, 6, 3208, 273, 11, 4, 1513, 15, 1367, 35, 154, 2, 103, 2, 173, 7, 12, 36, 515, 3547, 94, 2547, 1722, 5, 3547, 36, 203, 30, 502, 8, 361, 12, 8, 989, 143, 4, 1172, 3404, 10, 10, 328, 1236, 9, 6, 55, 221, 2989, 5, 146, 165, 179, 770, 15, 50, 713, 53, 108, 448, 23, 12, 17, 225, 38, 76, 4397, 18, 183, 8, 81, 19, 12, 45, 1257, 8, 135, 15, 2, 166, 4, 118, 7, 45, 2, 17, 466, 45, 2, 4, 22, 115, 165, 764, 6075, 5, 1030, 8, 2973, 73, 469, 167, 2127, 2, 1568, 6, 87, 841, 18, 4, 22, 4, 192, 15, 91, 7, 12, 304, 273, 1004, 4, 1375, 1172, 2768, 2, 15, 4, 22, 764, 55, 5773, 5, 14, 4233, 7444, 4, 1375, 326, 7, 4, 4760, 1786, 8, 361, 1236, 8, 989, 46, 7, 4, 2768, 45, 55, 776, 8, 79, 496, 98, 45, 400, 301, 15, 4, 1859, 9, 4, 155, 15, 66, 2, 84, 5, 14, 22, 1534, 15, 17, 4, 167, 2, 15, 75, 70, 115, 66, 30, 252, 7, 618, 51, 9, 2161, 4, 3130, 5, 14, 1525, 8, 6584, 15, 2, 165, 127, 1921, 8, 30, 179, 2532, 4, 22, 9, 906, 18, 6, 176, 7, 1007, 1005, 4, 1375, 114, 4, 105, 26, 32, 55, 221, 11, 68, 205, 96, 5, 4, 192, 15, 4, 274, 410, 220, 304, 23, 94, 205, 109, 9, 55, 73, 224, 259, 3786, 15, 4, 22, 528, 1645, 34, 4, 130, 528, 30, 685, 345, 17, 4, 277, 199, 166, 281, 5, 1030, 8, 30, 179, 4442, 444, 2, 9, 6, 371, 87, 189, 22, 5, 31, 7, 4, 118, 7, 4, 2068, 545, 1178, 829]


train_label example:1

example comment:<START> french horror cinema has seen something of a revival over the last couple of years with great films such as inside and <UNK> romance <UNK> on to the scene <UNK> <UNK> the revival just slightly but stands head and shoulders over most modern horror titles and is surely one of the best french horror films ever made <UNK> was obviously shot on a low budget but this is made up for in far more ways than one by the originality of the film and this in turn is <UNK> by the excellent writing and acting that ensure the film is a winner the plot focuses on two main ideas prison and black magic the central character is a man named <UNK> sent to prison for fraud he is put in a cell with three others the quietly insane <UNK> body building <UNK> marcus and his retarded boyfriend daisy after a short while in the cell together they stumble upon a hiding place in the wall that contains an old <UNK> after <UNK> part of it they soon realise its magical powers and realise they may be able to use it to break through the prison walls br br black magic is a very interesting topic and i'm actually quite surprised that there aren't more films based on it as there's so much scope for things to do with it it's fair to say that <UNK> makes the best of it's <UNK> as despite it's <UNK> the film never actually feels restrained and manages to flow well throughout director eric <UNK> provides a great atmosphere for the film the fact that most of it takes place inside the central prison cell <UNK> that the film feels very claustrophobic and this immensely benefits the central idea of the prisoners wanting to use magic to break out of the cell it's very easy to get behind them it's often said that the unknown is the thing that really <UNK> people and this film proves that as the director <UNK> that we can never really be sure of exactly what is round the corner and this helps to ensure that <UNK> actually does manage to be quite frightening the film is memorable for a lot of reasons outside the central plot the characters are all very interesting in their own way and the fact that the book itself almost takes on its own character is very well done anyone worried that the film won't deliver by the end won't be disappointed either as the ending both makes sense and manages to be quite horrifying overall <UNK> is a truly great horror film and one of the best of the decade highly recommended viewing

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________
Train on 15000 samples, validate on 10000 samples
Epoch 1/40
2019-01-21 14:43:32.942563: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
15000/15000 [==============================] - 1s 99us/step - loss: 0.6918 - acc: 0.5769 - val_loss: 0.6897 - val_acc: 0.6673
Epoch 2/40
15000/15000 [==============================] - 1s 75us/step - loss: 0.6856 - acc: 0.7239 - val_loss: 0.6814 - val_acc: 0.7441
Epoch 3/40
15000/15000 [==============================] - 1s 76us/step - loss: 0.6729 - acc: 0.7618 - val_loss: 0.6655 - val_acc: 0.7588
Epoch 4/40
15000/15000 [==============================] - 1s 75us/step - loss: 0.6503 - acc: 0.7727 - val_loss: 0.6406 - val_acc: 0.7663
Epoch 5/40
15000/15000 [==============================] - 1s 76us/step - loss: 0.6175 - acc: 0.7954 - val_loss: 0.6057 - val_acc: 0.7876
Epoch 6/40
15000/15000 [==============================] - 1s 82us/step - loss: 0.5764 - acc: 0.8154 - val_loss: 0.5660 - val_acc: 0.8038
Epoch 7/40
15000/15000 [==============================] - 1s 77us/step - loss: 0.5306 - acc: 0.8338 - val_loss: 0.5241 - val_acc: 0.8199
Epoch 8/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.4846 - acc: 0.8466 - val_loss: 0.4838 - val_acc: 0.8345
Epoch 9/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.4420 - acc: 0.8597 - val_loss: 0.4473 - val_acc: 0.8432
Epoch 10/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.4024 - acc: 0.8731 - val_loss: 0.4157 - val_acc: 0.8504
Epoch 11/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.3689 - acc: 0.8812 - val_loss: 0.3920 - val_acc: 0.8553
Epoch 12/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.3411 - acc: 0.8873 - val_loss: 0.3689 - val_acc: 0.8630
Epoch 13/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.3163 - acc: 0.8951 - val_loss: 0.3529 - val_acc: 0.8673
Epoch 14/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.2958 - acc: 0.9008 - val_loss: 0.3388 - val_acc: 0.8731
Epoch 15/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.2782 - acc: 0.9061 - val_loss: 0.3279 - val_acc: 0.8750
Epoch 16/40
15000/15000 [==============================] - 1s 95us/step - loss: 0.2631 - acc: 0.9091 - val_loss: 0.3190 - val_acc: 0.8768
Epoch 17/40
15000/15000 [==============================] - 1s 73us/step - loss: 0.2486 - acc: 0.9155 - val_loss: 0.3117 - val_acc: 0.8770
Epoch 18/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.2361 - acc: 0.9202 - val_loss: 0.3056 - val_acc: 0.8811
Epoch 19/40
15000/15000 [==============================] - 1s 73us/step - loss: 0.2244 - acc: 0.9239 - val_loss: 0.3006 - val_acc: 0.8823
Epoch 20/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.2143 - acc: 0.9273 - val_loss: 0.2966 - val_acc: 0.8835
Epoch 21/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.2043 - acc: 0.9301 - val_loss: 0.2932 - val_acc: 0.8822
Epoch 22/40
15000/15000 [==============================] - 1s 73us/step - loss: 0.1952 - acc: 0.9344 - val_loss: 0.2909 - val_acc: 0.8838
Epoch 23/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.1870 - acc: 0.9372 - val_loss: 0.2894 - val_acc: 0.8841
Epoch 24/40
15000/15000 [==============================] - 1s 73us/step - loss: 0.1787 - acc: 0.9425 - val_loss: 0.2872 - val_acc: 0.8850
Epoch 25/40
15000/15000 [==============================] - 1s 90us/step - loss: 0.1714 - acc: 0.9449 - val_loss: 0.2861 - val_acc: 0.8851
Epoch 26/40
15000/15000 [==============================] - 1s 75us/step - loss: 0.1641 - acc: 0.9473 - val_loss: 0.2859 - val_acc: 0.8846
Epoch 27/40
15000/15000 [==============================] - 1s 75us/step - loss: 0.1581 - acc: 0.9510 - val_loss: 0.2859 - val_acc: 0.8863
Epoch 28/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.1516 - acc: 0.9541 - val_loss: 0.2855 - val_acc: 0.8867
Epoch 29/40
15000/15000 [==============================] - 1s 73us/step - loss: 0.1457 - acc: 0.9549 - val_loss: 0.2857 - val_acc: 0.8866
Epoch 30/40
15000/15000 [==============================] - 1s 73us/step - loss: 0.1405 - acc: 0.9573 - val_loss: 0.2866 - val_acc: 0.8860
Epoch 31/40
15000/15000 [==============================] - 1s 72us/step - loss: 0.1344 - acc: 0.9601 - val_loss: 0.2878 - val_acc: 0.8869
Epoch 32/40
15000/15000 [==============================] - 1s 73us/step - loss: 0.1296 - acc: 0.9625 - val_loss: 0.2894 - val_acc: 0.8860
Epoch 33/40
15000/15000 [==============================] - 1s 74us/step - loss: 0.1241 - acc: 0.9647 - val_loss: 0.2910 - val_acc: 0.8864
Epoch 34/40
15000/15000 [==============================] - 1s 93us/step - loss: 0.1196 - acc: 0.9666 - val_loss: 0.2928 - val_acc: 0.8863
Epoch 35/40
15000/15000 [==============================] - 1s 73us/step - loss: 0.1155 - acc: 0.9674 - val_loss: 0.2945 - val_acc: 0.8860
Epoch 36/40
15000/15000 [==============================] - 1s 73us/step - loss: 0.1105 - acc: 0.9697 - val_loss: 0.2972 - val_acc: 0.8857
Epoch 37/40
15000/15000 [==============================] - 1s 72us/step - loss: 0.1065 - acc: 0.9705 - val_loss: 0.2997 - val_acc: 0.8854
Epoch 38/40
15000/15000 [==============================] - 1s 73us/step - loss: 0.1030 - acc: 0.9715 - val_loss: 0.3022 - val_acc: 0.8843
Epoch 39/40
15000/15000 [==============================] - 1s 75us/step - loss: 0.0986 - acc: 0.9734 - val_loss: 0.3048 - val_acc: 0.8832
Epoch 40/40
15000/15000 [==============================] - 1s 73us/step - loss: 0.0948 - acc: 0.9755 - val_loss: 0.3076 - val_acc: 0.8837
25000/25000 [==============================] - 1s 21us/step
[0.3284199892759323, 0.87224]

这是控制台输出的结果,另外还生成了两张jpg文件,代码我都加了注释,具体过程就略过,下面是附图

训练评估准确率图
从上图可以看出,validation acc并没有随着训练次数的增长一直增长,而是在训练20次之后基本恒定
训练评估loss图
从上图可以看出,validation loss并没有随着训练次数的增长一直减小,而是在训练25次之后基本恒定

两张图胜过千言万语,前面学习基本分类的时候,发现了一个疑问,及epochs取多少合适?在这次学习中,这个实例告诉了答案,此处,显然取25较为合理,虽然取更大的值可以增加训练精度和减少训练loss,但后面的训练只不过是过拟合,并无实用价值。以后遇到实际的分类预测问题只要画出这两张图,便可以直观的给出一个较为合理的值。当然了,图例展示的是数据,根据数据肯定也是可以直接计算出一个较为合理的值的。

学习总结

解决了一个epoch次数多少合理的疑问,对于程序进行了不断修改实验,加深了印象,例如# “bo” is for “blue dot”,根据这个猜测用ro,go进行了测试。最重要的是model.summary(),终于看到了参数,我的天16w的参数,以前的回归分析里面只是简单处理了两个参数,这里居然这么多,他们的含义是什么?把这个完全搞懂估计神经网络才能真正学会。另外这里的处理感觉很神奇,前面的基本分类,如果说是根据衣服形状分类的话,这里的单词是根据什么分出好坏的?总觉得不够形象,但也可以理解,好的词语组合,总是表示好的意思,大概是这样,加油!