Predict Zillow home prices ranges in Austin, TX using XGBoost
In this post I will break down a few strategies I used in my SLICED competitive submission and provide commentary on what I would have done beyond 2 hours.
In season 1 episode 11 of SLICED, a Kaggle hosted competitive data science streaming show, contestants were given the task of predicting Zillow home price ranges.
I competing alongside the contestants in the 2-hour window and actually did pretty good. The final private leaderboard had me at 2nd place out of 23 contestants. It's amazing to see an active and engaged data science community. Excited to have the opportunity to compete!
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import KFold, RandomizedSearchCV
import xgboost as xgb
from tqdm.notebook import tqdm
from sklearn.metrics import log_loss
import matplotlib.pyplot as plt
pal = sns.color_palette()
project_dir = "/content/drive/My Drive/ds_projects/sliced_s01_e11"
df_train = pd.read_csv(f"{project_dir}/train.csv")
df_test = pd.read_csv(f"{project_dir}/test.csv")
The train data has 10,000 homes with no missing values. Mostly numerical features but a few categorical features that may be useful.
df_train.info()
We will be predicting the 'priceRange' variable. This is a multi-class problem.
Thankfully, there does not appear to be a severe class imbalance. There are fewer very expensive homes in the \$650,000+ range and fewer lower priced homes in the \$0-$250,000 range.
df_train['priceRange'].value_counts()
What is an important predictor of price? Location, location, location
Austin prices are more expensive in the inner-most part of the city and are typically cheaper in the south and east sides of the city. Latitude and longitude are expected to be great predictors of price.
price_dict = {'650000+':600000,
'350000-450000':400000,
'0-250000':200000,
'450000-650000':500000,
'250000-350000':300000}
df_train['price'] = df_train['priceRange'].map(price_dict)
sns.scatterplot(data=df_train, x="longitude", y="latitude", hue='price', palette='Blues')
plt.legend(bbox_to_anchor=(1.05,1), loc=2)
plt.title('Austin prices differ based on location', loc='left', fontdict = {'fontsize' : 16});
The number of bedrooms and bathrooms may be useful for predicting price range. The boxplots below show that the number of bathrooms may be more helpful than number of bedrooms.
plt.figure(figsize=(10,5))
sns.boxplot(data=df_train, x="priceRange", y="numOfBathrooms",
order=['0-250000','250000-350000', '350000-450000',
'450000-650000', '650000+'])
plt.title('Median # of bathrooms is 2 for home prices less than \$650K', loc='left', fontdict = {'fontsize' : 16});
plt.figure(figsize=(10,5))
sns.boxplot(data=df_train, x="priceRange", y="numOfBedrooms",
order=['0-250000','250000-350000', '350000-450000',
'450000-650000', '650000+'])
plt.title('Number of Bedrooms does not appear to differ by price range', loc='left', fontdict = {'fontsize' : 16});
There are 3 categorical features besides the target. There are great opportunities for creating text features on the description feature in the future.
df_train.columns[df_train.dtypes == 'object']
I've looked at home price data in the past and I've found features that are similar to the home type are significant predictors.
Let's quickly dummy encode the homeType feature for use in modeling and prepare features for modeling.
df_test['priceRange'] = -1
# join the train and test dataframes
df = pd.concat([df_train, df_test])
# dummy encoding
df = pd.get_dummies(df, columns = ['homeType'])
df_train = df.loc[df['priceRange']!=-1,:]
df_test = df.loc[df['priceRange'] == -1,:]
num_feats = [
'latitude',
'longitude',
'yearBuilt',
'garageSpaces',
'numOfPatioAndPorchFeatures',
'lotSizeSqFt',
'avgSchoolRating',
'MedianStudentsPerTeacher',
'numOfBathrooms',
'numOfBedrooms',
'hasSpa'
]
# only considering homeType dummy features. I might try description later.
cat_feats = [f for f in df_train.columns if (f.find('homeType')!=-1)]
# define the target we are going to predict
target = 'priceRange'
First, let's use KFold cross validation to build a baseline XGBoost model using default parameters to measure our Multiclass log loss.
folds = 10
kf = KFold(folds)
df_preds = pd.DataFrame()
for train_idx, test_idx in tqdm(kf.split(df_train), total=folds):
train_data = df_train.iloc[train_idx]
test_data = df_train.iloc[test_idx].copy()
xgb_mod = xgb.XGBClassifier(objective = 'multi:softprob',
n_jobs=-1,
num_class = 5,
learning_rate=0.01,
n_estimators=1000,
random_state=20210810
)
xgb_mod.fit(train_data.loc[:, num_feats+cat_feats], train_data[target])
test_data['0-250000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,0], index=test_data.index)
test_data['250000-350000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,1], index=test_data.index)
test_data['350000-450000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,2], index=test_data.index)
test_data['450000-650000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,3], index=test_data.index)
test_data['650000+'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,4], index=test_data.index)
df_preds = df_preds.append(test_data)
log_loss(df_preds['priceRange'], df_preds.loc[:,['0-250000', '250000-350000', '350000-450000', '450000-650000', '650000+']])
That is not a great CV log loss but it provides a baseline for comparison after we make improvements.
Let's get serious, tune parameters, and then measure the performance improvement (hopefully).
Tuning parameters can be done in a few ways:
- Manually
- Greedy Grid Search
- Randomized Grid Search
- Bayesian optimization
- ... many others
Analytics Vidhya has an excellent article on different parameter tuning strategies and implementations that I highly recommend.
folds = 10
kf = KFold(folds)
df_preds = pd.DataFrame()
for train_idx, test_idx in tqdm(kf.split(df_train), total=folds):
train_data = df_train.iloc[train_idx]
test_data = df_train.iloc[test_idx].copy()
xgb_mod = xgb.XGBClassifier(objective = 'multi:softprob',
n_jobs=-1,
# hyperparameters tuned using randomized grid search below
num_class = 5,
subsample= 0.8,
n_estimators = 1200,
min_child_weight = 5,
max_depth = 7,
learning_rate = 0.015,
gamma = 1,
colsample_bytree = 0.6,
random_state=20210810
)
xgb_mod.fit(train_data.loc[:, num_feats+cat_feats], train_data[target])
test_data['0-250000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,0], index=test_data.index)
test_data['250000-350000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,1], index=test_data.index)
test_data['350000-450000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,2], index=test_data.index)
test_data['450000-650000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,3], index=test_data.index)
test_data['650000+'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,4], index=test_data.index)
df_preds = df_preds.append(test_data)
log_loss(df_preds['priceRange'], df_preds.loc[:,['0-250000', '250000-350000', '350000-450000', '450000-650000', '650000+']])
A log loss of 0.89 is an 11% improvement over the baseline model of 1.00. Awesome to see the value of tuning parameters and the impact on log loss.
XGBoost feature importance plots provide insight that x, y, and z are important in the model and a,b,c are not as important
importances = xgb_mod.feature_importances_
idxs = np.argsort(importances)
plt.figure(figsize=(10,6))
plt.title('Feature Importances')
plt.barh(range(len(idxs)), importances[idxs], align='center')
plt.yticks(range(len(idxs)), [train_data.loc[:, num_feats+cat_feats].columns[i] for i in idxs])
plt.xlabel('Feature Importance')
plt.show()
The 2 hour time limit was a great forcing function to focus on the most necessary activities to deliver a strong model.
In future iterations I would test:
- Text analysis on the description feature to extract additional information.
- Explore average school rating and median students per teacher.
- Possibly bring in external datasets to factor in additional socioeconomic factors.
- Additional location based features.
- Try additional models (LightGBM, catboost, etc.).