终于将tensorflow里面回归这部分的代码整理的差不多,可以写点东西了。前面都是先贴代码,然后看运行结果,接着再记录自己遇到的问题以及相关理解,最后做一个总结。后面记录的方式需要改一改,这么说吧,既然教程里面的都是面对特定问题的,那么首先就对问题进行描述,把数据格式写一下,后面就是每个实例里面额外需要注意的东西。接着再贴出代码以及执行的结果,记录自己在学习的过程中学到的东西和遇到问题,最后做一个总结。这次学习还是挺不容易的,不断问搜索大神,连蒙带猜,总算有些收获。

问题描述

回归是对连续值的预测,本例中需要根据已知的数据训练出模型,并预测燃油效率。该值应该是对应数据项中的MPG,而用来作为参数的数据有气缸数目,排量,马力和重量等,具体可以看程序输出的normed_train_data.tail。
在本实例中有一个比较重要的地方是,让我们的模型在训练到一定程度后自动停止训练,正如前面文本分类里面学到的epoch的合理取值,其实想想也是,假设一个超级模型需要训练好多天,我们不能等这么久再画出loss,acc和评估比对的图,再定合理的epoch值,达到一定条件自动停止就行。
训练次数平均错误绝对值
上图模型训练了1000次,可以看出,val error 并没有随着训练次数的增加而下降。
训练次数平均错误绝对值
上图本来也指定训练1000次,不过增加了earlystop监控val_loss满足条件后就不再训练,图中显示本例训练了175次。

回归源码

(venv) xianfa@xianfa:~/machinelearning/study$ cat kerasregression.py 
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Display training progress by printing a single dot for each completed epoch
class PrintDot(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs):
        print('.', end='')
        if (epoch+1) % 100 == 0:
            print(epoch+1)

def build_model():
    model = keras.Sequential([
    layers.Dense(64, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
    layers.Dense(64, activation=tf.nn.relu),
    layers.Dense(1)
    ])
    
    optimizer = tf.train.RMSPropOptimizer(0.001)
    model.compile(loss='mse',
            optimizer=optimizer,
            metrics=['mae', 'mse'])
    return model

def plot_history(hist, savefileprefix):
    plt.clf()
    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Mean Abs Error [MPG]')
    plt.plot(hist['epoch'], hist['mean_absolute_error'],
            label='Train Error')
    plt.plot(hist['epoch'], hist['val_mean_absolute_error'],
            label = 'Val Error')
    plt.legend()
    plt.ylim([0,5])
    plt.savefig('./' + savefileprefix + '-epoch-meanabserror.jpg')
    
    plt.clf()
    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Mean Square Error [$MPG^2$]')
    plt.plot(hist['epoch'], hist['mean_squared_error'],
            label='Train Error')
    plt.plot(hist['epoch'], hist['val_mean_squared_error'],
            label = 'Val Error')
    plt.legend()
    plt.ylim([0,20])
    plt.savefig('./' + savefileprefix + '-epoch-meansquarederror.jpg')

def get_dataset():
    #download first if not exist data
    dataset_path = keras.utils.get_file("auto-mpg.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
    column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight', 'Acceleration', 'Model Year', 'Origin']
    
    #read data from csvfile
    raw_dataset = pd.read_csv(dataset_path, names=column_names, na_values = "?", comment='\t', sep=" ", skipinitialspace=True)
    
    #copy data and drop useless data
    dataset = raw_dataset.copy()
    dataset = dataset.dropna()
    
    #deal with origin
    origin = dataset.pop('Origin')
    dataset['USA'] = (origin == 1)*1.0
    dataset['Europe'] = (origin == 2)*1.0
    dataset['Japan'] = (origin == 3)*1.0
    return dataset

if __name__ == '__main__':
    #get data and prehandle the data
    dataset = get_dataset()
    
    #divid into train and test data
    train_dataset = dataset.sample(frac=0.8,random_state=0)
    test_dataset = dataset.drop(train_dataset.index)
    
    #draw the train data
    plt.clf()
    sns.pairplot(train_dataset[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")
    plt.savefig('./pairplot.jpg')
    
    #prehandle train_stats used to normalize train_data and test_data
    train_stats = train_dataset.describe()
    train_stats.pop("MPG")
    train_stats = train_stats.transpose()
    print('\ntrain_status:\n' + str(train_stats))
    
    #target values
    train_labels = train_dataset.pop('MPG')
    test_labels = test_dataset.pop('MPG')
    
    #normalize
    normed_train_data = (train_dataset - train_stats['mean']) / train_stats['std']
    normed_test_data = (test_dataset - train_stats['mean']) / train_stats['std']

    #farmiliar with the data, tail head inspect the last or ahead 5 data.
    print('\nnormed_train_data.tail:\n' + str(normed_train_data.tail()))
    
    #build model
    model = build_model()
    model.summary()
    
    #test predict.question:Can it predict without training? Is it right? Whether just the primitive parameters?
    example_batch = normed_train_data[:10]
    example_result = model.predict(example_batch)
    print('simple predict with out training:\n' + str(example_result))

    #training all model
    EPOCHS = 1000
    print('\ntrain all epoch start:')
    history = model.fit(
            normed_train_data, train_labels,
            epochs=EPOCHS, validation_split = 0.2, verbose=0,
            callbacks=[PrintDot()])
    print('\ntrain all epoch finish')

    #handle training history data
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch
    
    #draw train epoch info
    plot_history(hist, 'trainall')

    #training partial model
    model = build_model()
    # The patience parameter is the amount of epochs to check for improvement
    early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=50)

    print('\ntrain partial epoch start:')
    earlystophistory = model.fit(
            normed_train_data, train_labels,
            epochs=EPOCHS,validation_split = 0.2, verbose=0,
            callbacks=[early_stop, PrintDot()])
    print('\ntrain partial epoch finish')
    
    #handle training history data
    earlystophist = pd.DataFrame(earlystophistory.history)
    earlystophist['epoch'] = earlystophistory.epoch

    #draw train epoch info
    plot_history(earlystophist, 'trainpatience')
    
    loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=0)
    print("\nTesting set Mean Abs Error: {:5.2f} MPG".format(mae))
    
    #predict the test_data
    test_predictions = model.predict(normed_test_data).flatten()
    
    #draw truevalue and predictions
    plt.clf()
    plt.scatter(test_labels, test_predictions)
    plt.xlabel('True Values [MPG]')
    plt.ylabel('Predictions [MPG]')
    plt.axis('equal')
    plt.axis('square')
    plt.xlim([0,plt.xlim()[1]])
    plt.ylim([0,plt.ylim()[1]])
    plt.plot([-100, 100], [-100, 100])
    plt.savefig('./truevalue-predictions.jpg')
    
    #draw prediction error and its count
    plt.clf()
    error = test_predictions - test_labels
    plt.hist(error, bins = 25)
    plt.xlabel("Prediction Error [MPG]")
    plt.ylabel("Count")
    plt.savefig('./predictionerror-count.jpg')

源码分析

执行结果

(venv) xianfa@xianfa:~/machinelearning/study$ python kerasregression.py 

train_status:
              count         mean         std     min      25%     50%      75%     max
Cylinders     314.0     5.477707    1.699788     3.0     4.00     4.0     8.00     8.0
Displacement  314.0   195.318471  104.331589    68.0   105.50   151.0   265.75   455.0
Horsepower    314.0   104.869427   38.096214    46.0    76.25    94.5   128.00   225.0
Weight        314.0  2990.251592  843.898596  1649.0  2256.50  2822.5  3608.00  5140.0
Acceleration  314.0    15.559236    2.789230     8.0    13.80    15.5    17.20    24.8
Model Year    314.0    75.898089    3.675642    70.0    73.00    76.0    79.00    82.0
USA           314.0     0.624204    0.485101     0.0     0.00     1.0     1.00     1.0
Europe        314.0     0.178344    0.383413     0.0     0.00     0.0     0.00     1.0
Japan         314.0     0.197452    0.398712     0.0     0.00     0.0     0.00     1.0

normed_train_data.tail:
     Cylinders  Displacement  Horsepower    Weight  Acceleration  Model Year       USA    Europe     Japan
281   0.307270      0.044872   -0.521559 -0.000298      0.946772    0.843910  0.774676 -0.465148 -0.495225
229   1.483887      1.961837    1.972127  1.457223     -1.598734    0.299787  0.774676 -0.465148 -0.495225
150  -0.869348     -0.836932   -0.311564 -0.710099     -0.021237   -0.516397 -1.286751 -0.465148  2.012852
145  -0.869348     -1.076553   -1.151543 -1.169870      1.233589   -0.516397 -1.286751 -0.465148  2.012852
182  -0.869348     -0.846517   -0.495310 -0.623596     -0.021237    0.027726 -1.286751  2.143005 -0.495225
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 64)                640       
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
=================================================================
Total params: 4,865
Trainable params: 4,865
Non-trainable params: 0
_________________________________________________________________
2019-01-22 11:39:00.361674: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
simple predict with out training:
[[0.00553696]
 [0.11620329]
 [0.39349222]
 [0.07164159]
 [0.23324086]
 [0.33498934]
 [0.2732039 ]
 [0.96387064]
 [0.35066384]
 [0.03314599]]

train all epoch start:
....................................................................................................100
....................................................................................................200
....................................................................................................300
....................................................................................................400
....................................................................................................500
....................................................................................................600
....................................................................................................700
....................................................................................................800
....................................................................................................900
....................................................................................................1000

train all epoch finish

train partial epoch start:
....................................................................................................100
.......................................................................
train partial epoch finish

Testing set Mean Abs Error:  2.01 MPG

这是控制台输出的结果,另外还生成了几张jpg文件,下面两个是最后预测值与实际值,预测值与真实值差值的长条分布图。

预测实际值图
从上图可以看出,预测值与真实值基本分布在y=x上,可以认为可靠。
预测错误长条分布图
从上图可以看出,差值越小次数越多,很漂亮的分布。

知识问题

首先是直观的知识点,print(‘.’, end=’‘)不会打印出换行,print(“\nTesting set Mean Abs Error: {:5.2f} MPG”.format(mae))格式化字符串的方法,normed_train_data.tail(),head()获取前后5条数据等等
接着是pandas和seaborn算初见,知道了pandas可以用来读取csv和pd.DataFrame这个用法,至于seaborn,先看下程序中用它画的图
联合密度分布
这个图踩的坑比较多,开始我以为下面是横坐标,左边是纵坐标,然后画的点,对于不同的量很好理解,而对于相同的量百思不得其解,相同的咋画都应该是一条y=x的直线。通过不断提问搜索大神,连蒙带猜,估计应该是密度函数,尤其是看了汽缸数目的图,更加肯定,因为例子中的数据好像都是4,6,8.暂时我就这么理解。当然对于这种第三方库,最好是能全面的学习下它们的教程。对于我来说,那种大而全的学习不太适合,遇到实际问题,然后再学,可能更适合我,当然了等学到的点多了,就可以看大而全的教程。
开始处理自动停止训练部分时,出来的图总是全部训练的图,跟我想的不一样。后面经过整理代码,终于发现,plot_history我传递的参数没有使用,而是已知在内部使用hist,才导致的结果。
在执行过程中,中间训练的时候显示的信息与前面的文本分类不同,经过试验发现是verbose置为1就可以出来了,开始以为verbose是遍历的意思,查了字典发现是繁琐的意思,繁琐不就是详细么。疑问就是这次训练打出来数据与前面不同,没有acc,估计是和模型有关,每个模型有什么训练结果参数不清楚,也没有深究,不懂得太多。

学习总结

原以为自己最开始就是以一个简单得回归问题开始学习,这次应该简单好学,结果发现恰恰相反,本例是目前我觉得最难的。这里面需要学到东西太多,所以很多疑问并没有一一列出。从代码角度,还是要好好规整代码,代码越清晰,问题也越容易查。还要规整打印到控制台的信息,不仅要美观还要实用。关于代码注释,中文英文均可,并不强求,加油!