997

# An MVP Aproach

## 1. 简介

### 1.2 库和库的作用

Python版本为3.6.1, 请导入下面列出的库（建议使用Anaconda）。请注意%matplotlib inline，这是为了在iPython Notebook允许图标内联。

import代码如下:

%matplotlib inline
import pandas as pd
from datetime import datetime
import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.linear_model import LinearRegression, Ridge,BayesianRidge
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import mean_squared_error
from math import radians, cos, sin, asin, sqrt
import seaborn as sns
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [16, 10]

### 1.4 数据探索

#### 1.4.1 文件结构

Let's start off by exploring the files we just imported, using Pandas' `head` and `describe` functions:
In [3]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
id vendor_id pickup_datetime dropoff_datetime passenger_count ……
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 ……
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 ……
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 ……
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 ……
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 ……

pd.set_option('display.float_format', lambda x: '%.3f' % x)的作用是设置float型数据的显示方式，显示3位小数。想了解更多请参考：这里.

#### 1.4.2 数据统计

pd.set_option('display.float_format', lambda x: '%.3f' % x)
train.describe()
vendor_id passenger_count pickup_longitude pickup_latitude dropoff_longitude ……
count 1458644.000 1458644.000 1458644.000 1458644.000 1458644.000 ……
mean 1.535 1.665 -73.973 40.751 -73.973 ……
std 0.499 1.314 0.071 0.033 0.071 ……
min 1.000 0.000 -121.933 34.360 -121.933 ……
25% 1.000 1.000 -73.992 40.737 -73.991 ……
50% 2.000 1.000 -73.982 40.754 -73.980 ……
75% 2.000 2.000 -73.967 40.768 -73.963 ……
max 2.000 9.000 -61.336 51.881 -61.336 ……

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
id                    1458644 non-null object
vendor_id             1458644 non-null int64
pickup_datetime       1458644 non-null object
dropoff_datetime      1458644 non-null object
passenger_count       1458644 non-null int64
pickup_longitude      1458644 non-null float64
pickup_latitude       1458644 non-null float64
dropoff_longitude     1458644 non-null float64
dropoff_latitude      1458644 non-null float64
store_and_fwd_flag    1458644 non-null object
trip_duration         1458644 non-null int64
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB

## 2. 数据预处理

vendorstore_and_fwd_flag一样。我们可以发现，有两个出租车公司，1和2，并且它们中的任何一个都对寻找“最短路线信息”没有帮助，因为要出租车公司的雇员在纽约寻找最佳路线几乎不可能。当然，我也觉得不可能。不过，这个可以也作为一个候选参数（至少也得先看看，最后再排除）。至于store_and_fwd_flag—— 在个别路线中和服务器没有连接这一特征，也可以用来说明几件事。例如，如果我们检查后发现行驶时间和服务器断连有很强的关联性，那么这个特征就可以用来训练模型预测那些个别路线要花费的行驶时间。

### 2.1 行驶时长数据清洗

m = np.mean(train['trip_duration'])
s = np.std(train['trip_duration'])
train = train[train['trip_duration'] <= m + 2*s]
train = train[train['trip_duration'] >= m - 2*s]

### 2.2 经纬度数据清洗

city_long_border = (-74.03, -73.75)
city_lat_border = (40.63, 40.85)

train = train[train['pickup_longitude'] <= -73.75]
train = train[train['pickup_longitude'] >= -74.03]
train = train[train['pickup_latitude'] <= 40.85]
train = train[train['pickup_latitude'] >= 40.63]
train = train[train['dropoff_longitude'] <= -73.75]
train = train[train['dropoff_longitude'] >= -74.03]
train = train[train['dropoff_latitude'] <= 40.85]
train = train[train['dropoff_latitude'] >= 40.63]

### 2.3 日期数据清洗

train['pickup_datetime'] = pd.to_datetime(train.pickup_datetime)
test['pickup_datetime'] = pd.to_datetime(test.pickup_datetime)
train.loc[:, 'pickup_date'] = train['pickup_datetime'].dt.date
test.loc[:, 'pickup_date'] = test['pickup_datetime'].dt.date
train['dropoff_datetime'] = pd.to_datetime(train.dropoff_datetime) #Not in Test

## 3. 数据可视化和数据分析

### 3.1 初步分析

plt.hist(train['trip_duration'].values, bins=100)
plt.xlabel('trip_duration')
plt.ylabel('number of train records')
plt.show()

train['log_trip_duration'] = np.log(train['trip_duration'].values + 1)
plt.hist(train['log_trip_duration'].values, bins=100)
plt.xlabel('log(trip_duration)')
plt.ylabel('number of train records')
plt.show()
sns.distplot(train["log_trip_duration"], bins =100)

plt.plot(train.groupby('pickup_date').count()[['id']], 'o-', label='train')
plt.plot(test.groupby('pickup_date').count()[['id']], 'o-', label='test')
plt.title('Trips over Time.')
plt.legend(loc=0)
plt.ylabel('Trips')
plt.show()

import warnings
warnings.filterwarnings("ignore")
plot_vendor = train.groupby('vendor_id')['trip_duration'].mean()
plt.subplots(1,1,figsize=(17,10))
plt.ylim(ymin=800)
plt.ylim(ymax=840)
sns.barplot(plot_vendor.index,plot_vendor.values)
plt.title('Time per Vendor')
plt.legend(loc=0)
plt.ylabel('Time in Seconds')

snwflag = train.groupby('store_and_fwd_flag')['trip_duration'].mean()

plt.subplots(1,1,figsize=(17,10))
plt.ylim(ymin=0)
plt.ylim(ymax=1100)
plt.title('Time per store_and_fwd_flag')
plt.legend(loc=0)
plt.ylabel('Time in Seconds')
sns.barplot(snwflag.index,snwflag.values)

pc = train.groupby('passenger_count')['trip_duration'].mean()

plt.subplots(1,1,figsize=(17,10))
plt.ylim(ymin=0)
plt.ylim(ymax=1100)
plt.title('Time per store_and_fwd_flag')
plt.legend(loc=0)
plt.ylabel('Time in Seconds')
sns.barplot(pc.index,pc.values)

train.groupby('passenger_count').size()

Out[15]:

passenger_count
0         52
1    1018715
2     206864
3      58989
4      27957
5      76912
6      47639
dtype: int64

In [16]:

test.groupby('passenger_count').size()

Out[16]:

passenger_count
0        23
1    443447
2     90027
3     25686
4     12017
5     33411
6     20521
9         2
dtype: int64

### 3.2 坐标映射

#### 3.2.1 上车位置

In [17]:

city_long_border = (-74.03, -73.75)
city_lat_border = (40.63, 40.85)
fig, ax = plt.subplots(ncols=2, sharex=True, sharey=True)
ax[0].scatter(train['pickup_longitude'].values[:100000], train['pickup_latitude'].values[:100000],
color='blue', s=1, label='train', alpha=0.1)
ax[1].scatter(test['pickup_longitude'].values[:100000], test['pickup_latitude'].values[:100000],
color='green', s=1, label='test', alpha=0.1)
fig.suptitle('Train and test area complete overlap.')
ax[0].legend(loc=0)
ax[0].set_ylabel('latitude')
ax[0].set_xlabel('longitude')
ax[1].set_xlabel('longitude')
ax[1].legend(loc=0)
plt.ylim(city_lat_border)
plt.xlim(city_long_border)
plt.show()

#### 3.2.2 距离和方向

def haversine_array(lat1, lng1, lat2, lng2):
lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
AVG_EARTH_RADIUS = 6371  # in km
lat = lat2 - lat1
lng = lng2 - lng1
d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
return h

def dummy_manhattan_distance(lat1, lng1, lat2, lng2):
a = haversine_array(lat1, lng1, lat1, lng2)
b = haversine_array(lat1, lng1, lat2, lng1)
return a + b

def bearing_array(lat1, lng1, lat2, lng2):
AVG_EARTH_RADIUS = 6371  # in km
lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)
return np.degrees(np.arctan2(y, x))

train.loc[:, 'distance_haversine'] = haversine_array(train['pickup_latitude'].values, train['pickup_longitude'].values, train['dropoff_latitude'].values, train['dropoff_longitude'].values)
test.loc[:, 'distance_haversine'] = haversine_array(test['pickup_latitude'].values, test['pickup_longitude'].values, test['dropoff_latitude'].values, test['dropoff_longitude'].values)

train.loc[:, 'distance_dummy_manhattan'] =  dummy_manhattan_distance(train['pickup_latitude'].values, train['pickup_longitude'].values, train['dropoff_latitude'].values, train['dropoff_longitude'].values)
test.loc[:, 'distance_dummy_manhattan'] =  dummy_manhattan_distance(test['pickup_latitude'].values, test['pickup_longitude'].values, test['dropoff_latitude'].values, test['dropoff_longitude'].values)

train.loc[:, 'direction'] = bearing_array(train['pickup_latitude'].values, train['pickup_longitude'].values, train['dropoff_latitude'].values, train['dropoff_longitude'].values)
test.loc[:, 'direction'] = bearing_array(test['pickup_latitude'].values, test['pickup_longitude'].values, test['dropoff_latitude'].values, test['dropoff_longitude'].values)

### 3.2.3 创建"Neighborhoods"

• 创建位置堆
• 配置KMeans聚类参数
• 创建实际的聚类
coords = np.vstack((train[['pickup_latitude', 'pickup_longitude']].values,
train[['dropoff_latitude', 'dropoff_longitude']].values))
sample_ind = np.random.permutation(len(coords))[:500000]
kmeans = MiniBatchKMeans(n_clusters=100, batch_size=10000).fit(coords[sample_ind])
train.loc[:, 'pickup_cluster'] = kmeans.predict(train[['pickup_latitude', 'pickup_longitude']])
train.loc[:, 'dropoff_cluster'] = kmeans.predict(train[['dropoff_latitude', 'dropoff_longitude']])
test.loc[:, 'pickup_cluster'] = kmeans.predict(test[['pickup_latitude', 'pickup_longitude']])
test.loc[:, 'dropoff_cluster'] = kmeans.predict(test[['dropoff_latitude', 'dropoff_longitude']])

fig, ax = plt.subplots(ncols=1, nrows=1)
ax.scatter(train.pickup_longitude.values[:500000], train.pickup_latitude.values[:500000], s=10, lw=0,
c=train.pickup_cluster[:500000].values, cmap='autumn', alpha=0.2)
ax.set_xlim(city_long_border)
ax.set_ylim(city_lat_border)
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
plt.show()

### 3.3 日期信息提取

In [24]:

#Extracting Month
train['Month'] = train['pickup_datetime'].dt.month
test['Month'] = test['pickup_datetime'].dt.month
In [25]:
train.groupby('Month').size(),test.groupby('Month').size()

Out[25]:

(Month
1    226444
2    235054
3    252443
4    247855
5    244591
6    230741
dtype: int64, Month
1     97676
2    102314
3    109697
4    107432
5    107570
6    100445
dtype: int64)

train['DayofMonth'] = train['pickup_datetime'].dt.day
test['DayofMonth'] = test['pickup_datetime'].dt.day
len(train.groupby('DayofMonth').size()),len(test.groupby('DayofMonth').size())

Out[26]:

(31, 31)

In [27]:

train['Hour'] = train['pickup_datetime'].dt.hour
test['Hour'] = test['pickup_datetime'].dt.hour
len(train.groupby('Hour').size()),len(test.groupby('Hour').size())

Out[27]:

(24, 24)

train['dayofweek'] = train['pickup_datetime'].dt.dayofweek
test['dayofweek'] = test['pickup_datetime'].dt.dayofweek
len(train.groupby('dayofweek').size()),len(test.groupby('dayofweek').size())

Out[28]:

(7, 7)

In [29]:

train.loc[:, 'avg_speed_h'] = 1000 * train['distance_haversine'] / train['trip_duration']
train.loc[:, 'avg_speed_m'] = 1000 * train['distance_dummy_manhattan'] / train['trip_duration']
fig, ax = plt.subplots(ncols=3, sharey=True)
ax[0].plot(train.groupby('Hour').mean()['avg_speed_h'], 'bo-', lw=2, alpha=0.7)
ax[1].plot(train.groupby('dayofweek').mean()['avg_speed_h'], 'go-', lw=2, alpha=0.7)
ax[2].plot(train.groupby('Month').mean()['avg_speed_h'], 'ro-', lw=2, alpha=0.7)
ax[0].set_xlabel('Hour of Day')
ax[1].set_xlabel('Day of Week')
ax[2].set_xlabel('Month of Year')
ax[0].set_ylabel('Average Speed')
fig.suptitle('Average Traffic Speed by Date-part')
plt.show()

## 4. 数据扩充和哑变量

### 4.1 数据扩充

fr1 = pd.read_csv('../input/new-york-city-taxi-with-osrm/fastest_routes_train_part_1.csv', usecols=['id', 'total_distance', 'total_travel_time',  'number_of_steps'])
fr2 = pd.read_csv('../input/new-york-city-taxi-with-osrm/fastest_routes_train_part_2.csv', usecols=['id', 'total_distance', 'total_travel_time', 'number_of_steps'])
usecols=['id', 'total_distance', 'total_travel_time', 'number_of_steps'])
train_street_info = pd.concat((fr1, fr2))
train = train.merge(train_street_info, how='left', on='id')
test = test.merge(test_street_info, how='left', on='id')

In [32]:

train.shape, test.shape

Out[32]:

((1437128, 29), (625134, 22))

### 4.2 创建哑变量

In [33]:

vendor_train = pd.get_dummies(train['vendor_id'], prefix='vi', prefix_sep='_')
vendor_test = pd.get_dummies(test['vendor_id'], prefix='vi', prefix_sep='_')
passenger_count_train = pd.get_dummies(train['passenger_count'], prefix='pc', prefix_sep='_')
passenger_count_test = pd.get_dummies(test['passenger_count'], prefix='pc', prefix_sep='_')
store_and_fwd_flag_train = pd.get_dummies(train['store_and_fwd_flag'], prefix='sf', prefix_sep='_')
store_and_fwd_flag_test = pd.get_dummies(test['store_and_fwd_flag'], prefix='sf', prefix_sep='_')
cluster_pickup_train = pd.get_dummies(train['pickup_cluster'], prefix='p', prefix_sep='_')
cluster_pickup_test = pd.get_dummies(test['pickup_cluster'], prefix='p', prefix_sep='_')
cluster_dropoff_train = pd.get_dummies(train['dropoff_cluster'], prefix='d', prefix_sep='_')
cluster_dropoff_test = pd.get_dummies(test['dropoff_cluster'], prefix='d', prefix_sep='_')

month_train = pd.get_dummies(train['Month'], prefix='m', prefix_sep='_')
month_test = pd.get_dummies(test['Month'], prefix='m', prefix_sep='_')
dom_train = pd.get_dummies(train['DayofMonth'], prefix='dom', prefix_sep='_')
dom_test = pd.get_dummies(test['DayofMonth'], prefix='dom', prefix_sep='_')
hour_train = pd.get_dummies(train['Hour'], prefix='h', prefix_sep='_')
hour_test = pd.get_dummies(test['Hour'], prefix='h', prefix_sep='_')
dow_train = pd.get_dummies(train['dayofweek'], prefix='dow', prefix_sep='_')
dow_test = pd.get_dummies(test['dayofweek'], prefix='dow', prefix_sep='_')

In [34]:

vendor_train.shape,vendor_test.shape

Out[34]:

((1437128, 2), (625134, 2))

In [35]:

passenger_count_train.shape,passenger_count_test.shape

Out[35]:

((1437128, 7), (625134, 8))

In [36]:

store_and_fwd_flag_train.shape,store_and_fwd_flag_test.shape

Out[36]:

((1437128, 2), (625134, 2))

In [37]:

cluster_pickup_train.shape,cluster_pickup_test.shape

Out[37]:

((1437128, 100), (625134, 100))

In [38]:

cluster_dropoff_train.shape,cluster_dropoff_test.shape

Out[38]:

((1437128, 100), (625134, 100))

In [39]:

month_train.shape,month_test.shape

Out[39]:

((1437128, 6), (625134, 6))

In [40]:

dom_train.shape,dom_test.shape

Out[40]:

((1437128, 31), (625134, 31))

In [41]:

hour_train.shape,hour_test.shape

Out[41]:

((1437128, 24), (625134, 24))

In [42]:

dow_train.shape,dow_test.shape

Out[42]:

((1437128, 7), (625134, 7))

In [43]:

passenger_count_test = passenger_count_test.drop('pc_9', axis = 1)

In [44]:

train = train.drop(['id','vendor_id','passenger_count','store_and_fwd_flag','Month','DayofMonth','Hour','dayofweek',
'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude'],axis = 1)
Test_id = test['id']
test = test.drop(['id','vendor_id','passenger_count','store_and_fwd_flag','Month','DayofMonth','Hour','dayofweek',
'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude'], axis = 1)

train = train.drop(['dropoff_datetime','avg_speed_h','avg_speed_m','pickup_lat_bin','pickup_long_bin','trip_duration'], axis = 1)

train.shape,test.shape

Out[45]:

((1437128, 11), (625134, 10))

Train_Master = pd.concat([train,
vendor_train,
passenger_count_train,
store_and_fwd_flag_train,
cluster_pickup_train,
cluster_dropoff_train,
month_train,
dom_train,
hour_test,
dow_train
], axis=1)

In [47]:

Test_master = pd.concat([test,
vendor_test,
passenger_count_test,
store_and_fwd_flag_test,
cluster_pickup_test,
cluster_dropoff_test,
month_test,
dom_test,
hour_test,
dow_test], axis=1)

In [48]:

Train_Master.shape,Test_master.shape

Out[48]:

((1437128, 290), (625134, 289))

Train_Master = Train_Master.drop(['pickup_datetime','pickup_date'],axis = 1)
Test_master = Test_master.drop(['pickup_datetime','pickup_date'],axis = 1)

In [50]:

Train_Master.shape,Test_master.shape

Out[50]:

((1437128, 288), (625134, 287))

In [51]:

Train, Test = train_test_split(Train_Master[0:100000], test_size = 0.2)

In [52]:

X_train = Train.drop(['log_trip_duration'], axis=1)
Y_train = Train["log_trip_duration"]
X_test = Test.drop(['log_trip_duration'], axis=1)
Y_test = Test["log_trip_duration"]

Y_test = Y_test.reset_index().drop('index',axis = 1)
Y_train = Y_train.reset_index().drop('index',axis = 1)

In [53]:

dtrain = xgb.DMatrix(X_train, label=Y_train)
dvalid = xgb.DMatrix(X_test, label=Y_test)
dtest = xgb.DMatrix(Test_master)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]

## 5. XGBoost - 模型训练和准确度测试

In [54]:

#md = [6]
#lr = [0.1,0.3]
#mcw = [20,25,30]
#for m in md:
#    for l in lr:
#        for n in mcw:
#            t0 = datetime.now()
#            xgb_pars = {'min_child_weight': mcw, 'eta': lr, 'colsample_bytree': 0.9,
#                        'max_depth': md,
#            'subsample': 0.9, 'lambda': 1., 'nthread': -1, 'booster' : 'gbtree', 'silent': 1,
#            'eval_metric': 'rmse', 'objective': 'reg:linear'}
#            model = xgb.train(xgb_pars, dtrain, 50, watchlist, early_stopping_rounds=10,
#                  maximize=False, verbose_eval=1)
[0] train-rmse:5.70042 valid-rmse:5.69993
[0] train-rmse:5.70042 valid-rmse:5.69993 Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.
Will train until valid-rmse hasn't improved in 10 rounds.
[11] train-rmse: 3.25699 valid-rmse: 3.25698
...
...
[89] train-rmse:0.335358 valid-rmse:0.345624
[90] train-rmse:0.334614 valid-rmse:0.344972
[91] train-rmse:0.333921 valid-rmse:0.344405

In [55]:

xgb_pars = {'min_child_weight': 1, 'eta': 0.5, 'colsample_bytree': 0.9,
'max_depth': 6,
'subsample': 0.9, 'lambda': 1., 'nthread': -1, 'booster' : 'gbtree', 'silent': 1,
'eval_metric': 'rmse', 'objective': 'reg:linear'}
model = xgb.train(xgb_pars, dtrain, 10, watchlist, early_stopping_rounds=2,
maximize=False, verbose_eval=1)
print('Modeling RMSLE %.5f' % model.best_score)
[0]    train-rmse:3.02558  valid-rmse:3.01507
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.
Will train until valid-rmse hasn't improved in 2 rounds.
[1]    train-rmse:1.55977  valid-rmse:1.55301
[2]    train-rmse:0.862101 valid-rmse:0.858751
[3]    train-rmse:0.56334  valid-rmse:0.564431
[4]    train-rmse:0.456502 valid-rmse:0.461474
[5]    train-rmse:0.421733 valid-rmse:0.430346
[6]    train-rmse:0.410094 valid-rmse:0.421733
[7]    train-rmse:0.404835 valid-rmse:0.418769
[8]    train-rmse:0.40078  valid-rmse:0.417905
[9]    train-rmse:0.398338 valid-rmse:0.417041
Modeling RMSLE 0.41704

• 迭代次数超过10次
• 更低的eta-value.
• 提高最大深度

In [56]:

xgb.plot_importance(model, max_num_features=28, height=0.7)

Out[56]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fe81a457ef0>

In [57]:

pred = model.predict(dtest)
pred = np.exp(pred) - 1

#### 提交结果

In [58]:

submission = pd.concat([Test_id, pd.DataFrame(pred)], axis=1)
submission.columns = ['id','trip_duration']
submission['trip_duration'] = submission.apply(lambda x : 1 if (x['trip_duration'] <= 0) else x['trip_duration'], axis = 1)
submission.to_csv("submission.csv", index=False)