Kaggle 实战:House Prices: Advanced Regression Techniques(上篇)

1,501 阅读10分钟
原文链接: www.qcloud.com

背景

机器学习主要分为分类和回归两类。上一篇文章我们通过实例介绍了利用决策树和随机森林来做分类。 这次我们来预测房价,实际演练一下R语言中的回归分析模型。

数据集

这次选择的竞赛网址为:www.kaggle.com/c/house-pri…

竞赛给了已经成交的近1500座房子的80个特征,然后让我们根据这些特征来预测房子的销售价格。数据集包含的特征字段相当多,除了地段、面积、层数等基本信息外,还有诸如地下室、离街道的距离、房屋的外墙材料等在国内完全不会关心的特征。 在房价如此疯狂的中国,基本只需要看到地段和面积就可以估算出来价格了。

数据熟悉

在动手构造模型之前,我们还是先熟悉一下数据的缺失和分布情况。

首先下载训练数据和测试数据,放在目录D:/RData/House/下,然后合并训练数据和测试数据。其中SalePrice就是这次要预测的房价字段。

#读取训练数据集和测试数据集
train <- read.csv("D:/RData/House/train.csv")
test <- read.csv("D:/RData/House/test.csv")


# 合并两个训练集
test$SalePrice <- NA
all <- rbind(train, test)

首先查看一下各个变量的情况。这里变量很多,在附件中附上变量的具体解释。

str(all)

结果:

'data.frame':   2919 obs. of  81 variables:
 $ Id   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
 $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
 $ LotArea  : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street   : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley: Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
 $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Utilities: Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
 $ LotConfig: Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
 $ LandSlope: Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
 $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
 $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
 $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
 $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt: int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
 $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
 $ RoofStyle: Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
 $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
 $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
 $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
 $ ExterQual: Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
 $ ExterCond: Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
 $ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
 $ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
 $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
 $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
 $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
 $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
 $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF: int  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
 $ Heating  : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ HeatingQC: Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
 $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
 $ X1stFlrSF: int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
 $ X2ndFlrSF: int  854 0 866 756 1053 566 0 983 752 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea: int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
 $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath : int  2 2 2 1 2 1 2 2 2 1 ...
 $ HalfBath : int  1 0 1 0 1 1 0 1 0 0 ...
 $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
 $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
 $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
 $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
 $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
 $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
 $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
 $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
 $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
 $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
 $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
 $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
 $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
 $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
 $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
 $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
 $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC   : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
 $ Fence: Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
 $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
 $ MiscVal  : int  0 0 0 0 0 700 0 350 0 0 ...
 $ MoSold   : int  2 5 9 2 12 10 8 11 4 1 ...
 $ YrSold   : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
 $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
 $ SalePrice: int  208500 181500 223500 140000 250000 143000 307000 200000 1299

变量主要分为两类,一类为数字类型,一类为因子类型。

# 获取数据中 factor变量的个数
res <- sapply(all, class )
table(res)

结果

 factor integer
 43      38

总体来看,数据集一共81个变量、2919个记录,其中43个因子变量,38个数字变量

特征处理

从上面的变量取值情况可以看到数据集中有很多变量存在缺失值,所以第一步我们要处理缺失值。

首先按照各变量中的缺失值所占比例排序

# 统计所有变量的缺失值
res <- sapply(all, function(x)  sum(is.na(x)) )

# 按照缺失率排序
miss <- sort(res, decreasing=T)
miss[miss>0]

执行结果 — 这里只给出了有缺失值的变量,并经过人工注释

# 变量           缺失数   缺失比例  含义
PoolQC           2909    100% # 泳池质量
MiscFeature      2814    96% # 特殊的设施
Alley            2721    93% # 房屋附近的小巷
Fence            2348    80% # 房屋的篱笆
FireplaceQu      1420    49% # 壁炉的质量

LotFrontage      486        17% # 房子同街道之间的距离

GarageYrBlt      159    5%  # 车库
GarageFinish     159    5%
GarageQual       159    5%
GarageCond       159    5%
GarageType       157    5%

BsmtCond        82    3% # 地下室
BsmtExposure    82    3%
BsmtQual        81    3%
BsmtFinType2    80    3%
BsmtFinType1    79    3%

MasVnrType      24    1%  # 外墙装饰
MasVnrArea      23    1%

MSZoning        4    0%  # 其他
Utilities       2    0%
BsmtFullBath    2    0%
BsmtHalfBath    2    0%
Functional      2    0%
Exterior1st     1    0%
Exterior2nd     1    0%
BsmtFinSF1      1    0%
BsmtFinSF2      1    0%
BsmtUnfSF       1    0%
TotalBsmtSF     1    0%
Electrical      1    0%
KitchenQual     1    0%
GarageCars      1    0%
GarageArea      1    0%
SaleType        1    0%

然后查看有缺失值的变量的概况。这里只给出了缺失值比较多的变量

# 查看有缺失数据的变量
summary(all[,names(miss)[miss>0]])

结果

 PoolQC     MiscFeature  Alley        Fence        SalePrice      FireplaceQu
 Ex  :   4   Gar2:   5   Grvl: 120   GdPrv: 118   Min.   : 34900   Ex  :  43 
 Fa  :   2   Othr:   4   Pave:  78   GdWo : 112   1st Qu.:129975   Fa  :  74 
 Gd  :   4   Shed:  95   NA's:2721   MnPrv: 329   Median :163000   Gd  : 744 
 NA's:2909   TenC:   1               MnWw :  12   Mean   :180921   Po  :  46 
             NA's:2814               NA's :2348   3rd Qu.:214000   TA  : 592 
                                                  Max.   :755000   NA's:1420 
                                                  NA's   :1459   

  LotFrontage      GarageYrBlt   GarageFinish GarageQual  GarageCond
 Min.   : 21.00   Min.   :1895   Fin : 719    Ex  :   3   Ex  :   3 
 1st Qu.: 59.00   1st Qu.:1960   RFn : 811    Fa  : 124   Fa  :  74 
 Median : 68.00   Median :1979   Unf :1230    Gd  :  24   Gd  :  15 
 Mean   : 69.31   Mean   :1978   NA's: 159    Po  :   5   Po  :  14 
 3rd Qu.: 80.00   3rd Qu.:2002                TA  :2604   TA  :2654 
 Max.   :313.00   Max.   :2207                NA's: 159   NA's: 159 
 NA's   :486      NA's   :159                                       

   GarageType   BsmtCond    BsmtExposure BsmtQual    BsmtFinType2 BsmtFinType1
 2Types :  23   Fa  : 104   Av  : 418    Ex  : 258   ALQ :  52    ALQ :429   
 Attchd :1723   Gd  : 122   Gd  : 276    Fa  :  88   BLQ :  68    BLQ :269   
 Basment:  36   Po  :   5   Mn  : 239    Gd  :1209   GLQ :  34    GLQ :849   
 BuiltIn: 186   TA  :2606   No  :1904    TA  :1283   LwQ :  87    LwQ :154   
 CarPort:  15   NA's:  82   NA's:  82    NA's:  81   Rec : 105    Rec :288   
 Detchd : 779                                        Unf :2493    Unf :851   
 NA's   : 157                                        NA's:  80    NA's: 79   

   MasVnrType     MasVnrArea        MSZoning     Utilities     BsmtFullBath   
 BrkCmn :  25   Min.   :   0.0   C (all):  25   AllPub:2916   Min.   :0.0000 
 BrkFace: 879   1st Qu.:   0.0   FV     : 139   NoSeWa:   1   1st Qu.:0.0000 
 None   :1742   Median :   0.0   RH     :  26   NA's  :   2   Median :0.0000 
 Stone  : 249   Mean   : 102.2   RL     :2265                 Mean   :0.4299 
 NA's   :  24   3rd Qu.: 164.0   RM     : 460                 3rd Qu.:1.0000 
                Max.   :1600.0   NA's   :   4                 Max.   :3.0000 
                NA's   :23                                    NA's   :2     

  BsmtHalfBath       Functional    Exterior1st    Exterior2nd     BsmtFinSF1   
 Min.   :0.00000   Typ    :2717   VinylSd:1025   VinylSd:1014   Min.   :   0.0 
 1st Qu.:0.00000   Min2   :  70   MetalSd: 450   MetalSd: 447   1st Qu.:   0.0 
 Median :0.00000   Min1   :  65   HdBoard: 442   HdBoard: 406   Median : 368.5 
 Mean   :0.06136   Mod    :  35   Wd Sdng: 411   Wd Sdng: 391   Mean   : 441.4 
 3rd Qu.:0.00000   Maj1   :  19   Plywood: 221   Plywood: 270   3rd Qu.: 733.0 
 Max.   :2.00000   (Other):  11   (Other): 369   (Other): 390   Max.   :5644.0 
 NA's   :2         NA's   :   2   NA's   :   1   NA's   :   1   NA's   :1     

   BsmtFinSF2        BsmtUnfSF       TotalBsmtSF     Electrical   KitchenQual
 Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   FuseA: 188   Ex  : 205 
 1st Qu.:   0.00   1st Qu.: 220.0   1st Qu.: 793.0   FuseF:  50   Fa  :  70 
 Median :   0.00   Median : 467.0   Median : 989.5   FuseP:   8   Gd  :1151 
 Mean   :  49.58   Mean   : 560.8   Mean   :1051.8   Mix  :   1   TA  :1492 
 3rd Qu.:   0.00   3rd Qu.: 805.5   3rd Qu.:1302.0   SBrkr:2671   NA's:   1 
 Max.   :1526.00   Max.   :2336.0   Max.   :6110.0   NA's :   1             
 NA's   :1         NA's   :1        NA's   :1           

   GarageCars      GarageArea        SaleType   
 Min.   :0.000   Min.   :   0.0   WD     :2525 
 1st Qu.:1.000   1st Qu.: 320.0   New    : 239 
 Median :2.000   Median : 480.0   COD    :  87 
 Mean   :1.767   Mean   : 472.9   ConLD  :  26 
 3rd Qu.:2.000   3rd Qu.: 576.0   CWD    :  12 
 Max.   :5.000   Max.   :1488.0   (Other):  29 
 NA's   :1       NA's   :1        NA's   :   1

缺失数据的变量有很多,处理情况可以分为如下几类:

直接数据集中剔除哪些存在大量缺失值的变量

缺失量比较多的PoolQC、MiscFeature、Alley、Fence、FireplaceQu是由于房子没有泳池、特殊的设施、旁边的小巷、篱笆、壁炉等设施。 由于缺失量比较多,我们直接移除这几个变量。

# 删除如下变量
Drop <- names(all) %in% c("PoolQC","MiscFeature","Alley","Fence","FireplaceQu")
all <- all[!Drop]

将NA作为新的一个因子

查看变量的描述文件可以知道,车库相关的五个变量GarageType、GarageYrBlt、GarageFinish、GarageQual、GarageCond也是由于房子没有车库而缺失。

同理,BsmtExposure、BsmtFinType2、BsmtQual、BsmtCond、BsmtFinType1五个变量是关于地下室的,都是由于房子没有地下室而缺失。

此类变量缺失的数量比较少,直接用None来替换缺失值。

# 将如下变量的NA值填充为None
Garage <- c("GarageType","GarageQual","GarageCond","GarageFinish")
Bsmt <- c("BsmtExposure","BsmtFinType2","BsmtQual","BsmtCond","BsmtFinType1")
for (x in c(Garage, Bsmt) )
{
all[[x]] <- factor( all[[x]], levels= c(levels(all[[x]]),c('None')))
all[[x]][is.na(all[[x]])] <- "None"
}

其中GarageYrBlt为车库的年份,我们用房子的建造年份来替代

# 单独处理车库年份
all$GarageYrBlt[is.na(all$GarageYrBlt)] <- all$YearBuilt[is.na(all$GarageYrBlt)]

人工补齐缺失值

对剩下的变量我们依次查看其详细数据,可以分别如下处理。

变量 LotFrontage 房子到街道的距离

这是一个数值变量,我们用中位数Median来补充。

# 用中位数来填充
all$LotFrontage[is.na(all$LotFrontage)] <- median(all$LotFrontage, na.rm = T)

变量 MasVnrType 外墙装饰材料

这个变量对价钱的影响应该不大,MasVnrType中的NA用它本身的None来代替

# 用None补充
all[["MasVnrType"]][is.na(all[["MasVnrType"]])] <- "None"

变量 MasVnrArea 外墙装饰材料的面积

这个缺失值对应着MasVnrType的None值,应该将NA用0来替代

# 用0补充
all[["MasVnrArea"]][is.na(all[["MasVnrArea"]])] <- 0

变量 Utilities 没有区分度,直接丢弃

# 删除变量 Utilities
all$Utilities <- NULL

变量 BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF GarageCars GarageArea 则是由于不存在相应的设施而缺失,这些变量都是数字变量,所以都补充为0即可。

# 由于设施缺失,导致数量缺失,补充为0 
Param0 <- c("BsmtFullBath","BsmtHalfBath","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","GarageCars","GarageArea")
for (x in Param0 )    all[[x]][is.na(all[[x]])] <- 0

变量MSZoning,Functional,Exterior1st,Exterior2nd,KitchenQual,Electrical,SaleType
这些变量都是因子变量,并且只有几个缺失值,直接用最多的因子来代替

# 用最高频的因子来补充
Req <- c("MSZoning","Functional","Exterior1st","Exterior2nd","KitchenQual","Electrical","SaleType")
for (x in Req )    all[[x]][is.na(all[[x]])] <- levels(all[[x]])[which.max(table(all[[x]]))]

生成训练集

经过一系列的缺失值补齐之后,我们看到最后剩余75个变量,并且不存在缺失数据。
我们通过SalePrice是否为NA来将数据集拆分为训练集和测试集,为后面的模型训练做准备。

# 通过SalePrice是否为空来区分训练集和测试集
train <- all[!is.na(all$SalePrice), ]
test <- all[is.na(all$SalePrice), ]

回归模型

线性回归的最主要的问题就是自变量的选择。选择那些与最后预测的响应变量相关度比较高的特征变量是模型成功的第一步。变量选择有很多方法,其中最关键同时也是最直接的方法就是分析师根据业务场景人工筛选。
我们首先尝试这种变量选择的方法,作为我们模型的第一步。

接下篇《Kaggle 实战-House Prices: Advanced Regression Techniques (2)》