airbnb-challenge

Airbnb Algorithm Challenge Report

The challenge is to predict whether a listing will be booked (dim_is_requested) 30 days later. And here we’ve got 125k records consisting of 45 features, collected in the year of 2015. As the dataset is not that large, there is no need for distributed computing. I will just work through the general modeling process on my laptop, with tools:

Data Exploration

I’d like to explore the dataset before handing on experiment, to exclude those abnormalities and make data cleaner. The abnormalities include dupes, noises and outliers, etc.

Dupes

First of all, let’s remove 131 duplicated records.

import pandas as pd
import numpy as np
import math

df = pd.read_csv('TH_data_challenge.tsv', sep='\t')
df.duplicated(keep='first').sum()
df = df.drop_duplicates(keep='first')

Noises

Noises in this case can be contrary labels with similar features, but there seems no such record in this dataset.

df_no_noises = df.drop(columns =['dim_is_requested'])
df_no_noises.duplicated(keep='first').sum()

Outliers

Most outliers are normaly due to system issues. Such as: price is abnormally low (<= 0):

p_check = [.0001,.10, .25, .5, .75, .9 , .9999]
df.describe(percentiles = p_check)
# I assume kdt_n exclude itself
df_no_outliers = df[(df.m_effective_daily_price > 0)]
df_no_outliers.describe(percentiles = p_check)

	m_effective_daily_price	m_pricing_cleaning_fee	dim_lat	dim_lng	dim_person_capacity	m_checkouts	m_reviews	days_since_last_booking	cancel_policy	image_quality_score	...	general_market_m_reservation_requests_0_6_ds_night	general_market_m_is_booked_0_6_ds_night	m_available_listings_ds_night	kdt_score	r_kdt_listing_views_0_6_avg_n100	r_kdt_n_active_n100	r_kdt_n_available_n100	r_kdt_m_effective_daily_price_n100_p50	r_kdt_m_effective_daily_price_available_n100_p50	r_kdt_m_effective_daily_price_booked_n100_p50
count	184086.000000	184086.000000	184086.000000	184086.000000	184086.000000	183899.000000	183899.000000	146333.000000	184086.000000	170175.000000	...	184086.000000	184086.000000	184086.000000	184086.000000	184085.000000	184085.000000	184085.000000	184085.000000	184069.000000	171130.000000
mean	149.497835	38.035943	43.552902	-44.264007	3.265952	18.473613	10.979075	69.202381	4.154743	0.562027	...	91.597895	0.973107	18555.870006	1.067698	2.098329	90.223679	45.649189	109.225206	115.375352	95.582751
std	272.345131	50.003168	6.817416	59.183739	2.009910	32.318123	20.026387	123.562999	0.816426	0.306809	...	48.609062	0.071513	9099.360406	0.523169	2.081747	75.809958	37.513264	64.010527	72.718040	48.724038
min	0.043935	0.000000	33.708763	-122.510925	1.000000	0.000000	0.000000	0.000000	3.000000	0.000000	...	13.142857	0.571429	1028.000000	-1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
0.01%	0.045791	0.000000	33.708763	-122.510380	1.000000	0.000000	0.000000	0.000000	3.000000	0.000000	...	13.142857	0.571429	1064.000000	0.000000	0.000000	1.000000	1.000000	16.432791	15.921237	14.798372
10%	51.156967	0.000000	34.039494	-118.603200	2.000000	0.000000	0.000000	1.000000	3.000000	0.119119	...	37.142857	0.857143	7681.000000	0.416667	0.346939	21.000000	12.000000	56.020839	59.000000	49.000000
25%	71.386212	0.000000	34.129005	-118.357056	2.000000	1.000000	0.000000	5.000000	3.000000	0.300549	...	56.000000	1.000000	9690.000000	0.666667	0.857143	48.000000	25.000000	75.000000	77.743820	65.000000
50%	100.000000	25.000000	48.845016	2.303334	2.000000	6.000000	3.000000	20.000000	4.000000	0.577921	...	83.142857	1.000000	21027.000000	1.037037	1.632653	86.000000	41.000000	95.275607	98.550440	86.058520
75%	152.468910	53.947735	48.871037	2.355691	4.000000	22.000000	12.000000	72.000000	5.000000	0.853429	...	116.714286	1.000000	24867.000000	1.396372	2.717687	113.000000	56.000000	129.000000	135.459811	115.719597
90%	250.000000	100.000000	48.887280	2.384931	6.000000	53.000000	31.000000	197.000000	5.000000	0.960145	...	161.571429	1.000000	28648.000000	1.785714	4.134354	141.000000	73.000000	172.500000	182.031751	150.000000
99.99%	9000.000000	450.000000	49.152172	2.887966	16.000000	352.830600	234.610200	1012.100400	8.000000	0.999986	...	271.571429	1.000000	37892.000000	3.000000	31.966661	835.591600	422.366400	1099.000000	1295.000000	999.887100
max	12995.000000	800.000000	49.187890	2.887966	16.000000	432.000000	280.000000	1041.000000	8.000000	0.999996	...	271.571429	1.000000	37892.000000	3.000000	45.081633	910.000000	483.000000	1374.500000	1500.000000	1995.000000

12 rows × 38 columns

Preprocessing

In preprocessing, I need to encode categorical features, drop/fill empty values, split the dataset to training and validation sets, etc. However since I used XGBoost, there is not need dealing with missing values and normalizing dataset.

Raw Features List

Thanks to neural net, I can use as many features as possible:

m_effective_daily_price [continuous]
m_pricing_cleaning_fee [continuous]
dim_market [categorical]
dim_room_type [categorical]
dim_person_capacity [discrete]
dim_is_instant_bookable [nominal]
m_checkouts [discrete]
m_reviews [discrete]
days_since_last_booking [discrete]
cancel_policy [categorical]
image_quality_score [continuous]
m_total_overall_rating [discrete]
m_professional_pictures [discrete]
dim_has_wireless_internet [nominal]
ds_night_day_of_week [discrete]
ds_night_day_of_year [discrete]
ds_checkin_gap [discrete]
ds_checkout_gap [discrete]
occ_occupancy_plus_minus_7_ds_night [continuous]
occ_occupancy_plus_minus_14_ds_night [continuous]
occ_occupancy_trailing_90_ds [continuous]
m_minimum_nights [discrete]
m_maximum_nights [discrete]
price_booked_most_recent [continuous]
p2_p3_click_through_score [continuous]
p3_inquiry_score [continuous]
listing_m_listing_views_2_6_ds_night_decay [discrete]
general_market_m_unique_searchers_0_6_ds_night [discrete]
general_market_m_contacts_0_6_ds_night [discrete]
general_market_m_reservation_requests_0_6_ds_night [discrete]
general_market_m_is_booked_0_6_ds_night [discrete]
m_available_listings_ds_night [discrete]
kdt_score [continuous]
r_kdt_listing_views_0_6_avg_n100 [discrete]
r_kdt_n_active_n100 [discrete]
r_kdt_n_available_n100 [discrete]
r_kdt_m_effective_daily_price_n100_p50 [continuous]
r_kdt_m_effective_daily_price_available_n100_p50 [continuous]
r_kdt_m_effective_daily_price_booked_n100_p50 [continuous]

NULL

This is about whether to drop those records with null or fill them with certain values. Here I actually have no need to do anything since XGB will handle those missing values. :)

df_select = df_no_outliers.drop(columns=['ds_night',
                                           'ds',
                                           'id_listing_anon',
                                           'id_user_anon',
                                           'dim_lat',
                                           'dim_lng'])  

Categorical Features

Encoding string- or integer- like categorical features:

from sklearn.preprocessing import CategoricalEncoder, RobustScaler, StandardScaler

enc = CategoricalEncoder(encoding='ordinal')
encoded_features = enc.fit_transform(df_select[['dim_is_requested','dim_market','dim_room_type','cancel_policy','dim_is_instant_bookable']])
encoded_df = pd.DataFrame(encoded_features, index=df_select.index, columns=['dim_is_requested','dim_market','dim_room_type','cancel_policy','dim_is_instant_bookable'])
encoded_df.head(2)

	dim_is_requested	dim_market	dim_room_type	cancel_policy	dim_is_instant_bookable
0	0.0	0.0	1.0	1.0	0.0
1	1.0	0.0	1.0	0.0	1.0

col = df_select.columns.tolist()
col = col[1:3] + col[5:6] + col[7:10] + col[11:]
col_cat = encoded_df.columns.tolist()
col_full = col_cat[:] + col[:]
df = df_select[col]

stack_full = np.column_stack([encoded_df, df])
stack_df = pd.DataFrame(stack_full, index=df_select.index, columns=col_full)
stack_df.head(2)

	dim_is_requested	dim_market	dim_room_type	cancel_policy	dim_is_instant_bookable	m_effective_daily_price	m_pricing_cleaning_fee	dim_person_capacity	m_checkouts	m_reviews	...	general_market_m_reservation_requests_0_6_ds_night	general_market_m_is_booked_0_6_ds_night	m_available_listings_ds_night	kdt_score	r_kdt_listing_views_0_6_avg_n100	r_kdt_n_active_n100	r_kdt_n_available_n100	r_kdt_m_effective_daily_price_n100_p50	r_kdt_m_effective_daily_price_available_n100_p50	r_kdt_m_effective_daily_price_booked_n100_p50
0	0.0	0.0	1.0	1.0	0.0	110.0	60.0	2.0	24.0	19.0	...	41.428571	1.0	9840.0	1.314286	1.795918	57.0	48.0	79.0	75.0	98.0
1	1.0	0.0	1.0	0.0	1.0	70.0	0.0	2.0	105.0	55.0	...	51.714286	1.0	9599.0	0.535714	2.693878	49.0	36.0	65.0	65.0	42.0

2 rows × 40 columns

Split

After above transformations, we can split samples into training/test sets. And standardization is an important step before feeding the matrix to neural nets.

from sklearn.cross_validation import train_test_split

y = stack_df['dim_is_requested']
X = stack_df.drop(columns='dim_is_requested')
train_inputs, test_inputs, train_output, test_output= train_test_split(X, y, test_size = 0.2, random_state =42)

/usr/local/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Modeling

Here I chose XGBoost as baseline model as it achieved better performance than neural nets in my experiments. And grid search is used for tuning parameters. First of all, checking the class balance:

class_1 = len(stack_df.loc[df_select['dim_is_requested'] == 1])
class_0 = len(df_select.loc[df_select['dim_is_requested'] == 0])
class_1 / class_0

0.48869444264734424

Baseline Model

Because of high imbalance class, here I set both default min_child_weight and scale_pos_weight to 1.

import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics
from sklearn.grid_search import GridSearchCV
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 12, 4

def create_feature_map(file_name,features):
    outfile = open(file_name, 'w')
    for i, feat in enumerate(features):
        outfile.write('{0}\t{1}\tq\n'.format(i, feat))
    outfile.close()

def modelfit(alg, X_train, y_train, X_test, y_test, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(X_train, label=y_train)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(X_train, y_train,eval_metric='auc')
        
    #Predict training set:
    dtrain_predictions = alg.predict(X_train)
    dtrain_predprob = alg.predict_proba(X_train)[:,1]
    
    #Predict test set:
    dtest_predictions = alg.predict(X_test)
    dtest_predprob = alg.predict_proba(X_test)[:,1]
        
    #Print model report:
    print("\nModel Report")
    print("Accuracy (Train): %.4g" % metrics.accuracy_score(y_train, dtrain_predictions))
    print("AUC Score (Train): %f" % metrics.roc_auc_score(y_train, dtrain_predprob))
    print("Accuracy (Test): %.4g" % metrics.accuracy_score(y_test, dtest_predictions))
    print("AUC Score (Test): %f" % metrics.roc_auc_score(y_test, dtest_predprob))
                    
    feat_imp = pd.Series(alg.get_booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')
    
create_feature_map('listingreq.fmap', stack_df)

xgb1 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb1, train_inputs, train_output, test_inputs, test_output)

Model Report
Accuracy (Train): 1
AUC Score (Train): 1.000000
Accuracy (Test): 0.8948
AUC Score (Test): 0.953460

png

Parameter Tuning

Tune max_depth and min_child_weight

Tuning tree related parameters, and it might take a while (~20min):

param_test1 = {
 'max_depth': [4,7,10],
 'min_child_weight': [1,4,7]
}
gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=100, max_depth=4,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train_inputs, train_output)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

The output looks like:

([mean: 0.87861, std: 0.00157, params: {'max_depth': 4, 'min_child_weight': 1},
 mean: 0.87862, std: 0.00156, params: {'max_depth': 4, 'min_child_weight': 4}, 
mean: 0.87883, std: 0.00135, params: {'max_depth': 4, 'min_child_weight': 7}, 
mean: 0.90129, std: 0.00111, params: {'max_depth': 7, 'min_child_weight': 1},
 mean: 0.90073, std: 0.00122, params: {'max_depth': 7, 'min_child_weight': 4}, 
mean: 0.90003, std: 0.00130, params: {'max_depth': 7, 'min_child_weight': 7}, 
mean: 0.91802, std: 0.00075, params: {'max_depth': 10, 'min_child_weight': 1}, 
mean: 0.91539, std: 0.00069, params: {'max_depth': 10, 'min_child_weight': 4}, 
mean: 0.91413, std: 0.00114, params: {'max_depth': 10, 'min_child_weight': 7}], 
{'max_depth': 10, 'min_child_weight': 1}, 0.9180213075672359)

Tune gamma

Tuning split restrictions w/ updated max_depth and min_child_weight:

param_test2 = {
 'gamma':[i/10.0 for i in range(0,10,2)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=100, max_depth=10,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train_inputs, train_output)
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

The output looks like:

[mean: 0.91802, std: 0.00075, params: {'gamma': 0.0}, 
mean: 0.91783, std: 0.00060, params: {'gamma': 0.2}, 
mean: 0.91827, std: 0.00081, params: {'gamma': 0.4}, 
mean: 0.91805, std: 0.00089, params: {'gamma': 0.6}, 
mean: 0.91786, std: 0.00086, params: {'gamma': 0.8}] 
{'gamma': 0.4} 0.9182724612977561

Tune subsample and colsample_bytree

Tuning feature samples related parameters with above optimal setting:

param_test3 = {
 'subsample':[i/10.0 for i in range(6,10,2)],
 'colsample_bytree':[i/10.0 for i in range(6,10,2)]
}
gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=100, max_depth=10,
 min_child_weight=1, gamma=0.4, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train_inputs, train_output)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

The output looks like:

[mean: 0.91585, std: 0.00114, params: {'colsample_bytree': 0.6, 'subsample': 0.6}, 
mean: 0.91825, std: 0.00058, params: {'colsample_bytree': 0.6, 'subsample': 0.8}, 
mean: 0.91992, std: 0.00087, params: {'colsample_bytree': 0.6, 'subsample': 1.0}, 
mean: 0.91589, std: 0.00125, params: {'colsample_bytree': 0.8, 'subsample': 0.6}, 
mean: 0.91827, std: 0.00081, params: {'colsample_bytree': 0.8, 'subsample': 0.8}, 
mean: 0.91914, std: 0.00085, params: {'colsample_bytree': 0.8, 'subsample': 1.0}, 
mean: 0.91548, std: 0.00140, params: {'colsample_bytree': 1.0, 'subsample': 0.6}, 
mean: 0.91703, std: 0.00062, params: {'colsample_bytree': 1.0, 'subsample': 0.8}, 
mean: 0.91725, std: 0.00088, params: {'colsample_bytree': 1.0, 'subsample': 1.0}] 
{'colsample_bytree': 0.6, 'subsample': 1.0} 0.9199152032945523

Tune Regularization Parameters

Tuning lambda to prevent overfitting:

param_test4 = {
 'reg_lambda':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0.4, subsample=1.0, colsample_bytree=0.6,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch6.fit(train_inputs, train_output)
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

The output looks like:

[mean: 0.91595, std: 0.00120, params: {'reg_lambda': 1e-05}, 
 mean: 0.91595, std: 0.00095, params: {'reg_lambda': 0.01}, 
 mean: 0.91525, std: 0.00079, params: {'reg_lambda': 0.1}, 
 mean: 0.91548, std: 0.00140, params: {'reg_lambda': 1}, 
 mean: 0.89979, std: 0.00111, params: {'reg_lambda': 100}] 
{'reg_lambda': 1e-05} 0.9159459589963668

Tuning Learning Rate

xgb2 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=5000,
 max_depth=10,
 min_child_weight=1,
 gamma=0.4,
 subsample=1.0,
 colsample_bytree=0.6,
 reg_lambda=1e-05,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb2, train_inputs, train_output, test_inputs, test_output)