一、概述
书中3.16节扩展一下可以作为kaggle比赛的框架,这个赛题的名字是House Prices: Advanced Regression Techniques,是一个Regression问题。
二、Deeplearning的一般流程
结合李航《统计学习方法》中对机器学习流程的总结,分为data、model、strategy、algorithm、training、prediction
1、 Data
1.1、read data
# read data
train_data = pd.read_csv('./d2l-zh-1.1/data/kaggle_house_pred_train.csv')
test_data = pd.read_csv('./d2l-zh-1.1/data/kaggle_house_pred_test.csv')
# print(train_data.shape)
# print(train_data.iloc[0:4, [0, 1, 2, -1, -2, -3]])
1.2、preprocess data
# standardization to numeric type
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(
lambda x: (x - x.mean()) / x.std())
# 标准化后,每个特征的均值变为0,所以可以直接用0来替换缺失值
all_features[numeric_features] = all_features[numeric_features].fillna(0)
# convert discrete value to dummy variable
all_features = pd.get_dummies(all_features, dummy_na=True)
# get train and test data
n_train = train_data.shape[0]
train_features = nd.array(all_features[