In season 1 episode 11 of SLICED, a Kaggle hosted competitive data science streaming show, contestants were given the task of predicting Zillow home price ranges.

I competing alongside the contestants in the 2-hour window and actually did pretty good. The final private leaderboard had me at 2nd place out of 23 contestants. It's amazing to see an active and engaged data science community. Excited to have the opportunity to compete!

Import packages and set up the train/test data

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import KFold, RandomizedSearchCV
import xgboost as xgb
from tqdm.notebook import tqdm 
from sklearn.metrics import log_loss
import matplotlib.pyplot as plt

pal = sns.color_palette()

project_dir = "/content/drive/My Drive/ds_projects/sliced_s01_e11"

df_train = pd.read_csv(f"{project_dir}/train.csv")
df_test = pd.read_csv(f"{project_dir}/test.csv")

The train data has 10,000 homes with no missing values. Mostly numerical features but a few categorical features that may be useful.

df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   uid                         10000 non-null  int64  
 1   city                        10000 non-null  object 
 2   description                 10000 non-null  object 
 3   homeType                    10000 non-null  object 
 4   latitude                    10000 non-null  float64
 5   longitude                   10000 non-null  float64
 6   garageSpaces                10000 non-null  int64  
 7   hasSpa                      10000 non-null  bool   
 8   yearBuilt                   10000 non-null  int64  
 9   numOfPatioAndPorchFeatures  10000 non-null  int64  
 10  lotSizeSqFt                 10000 non-null  float64
 11  avgSchoolRating             10000 non-null  float64
 12  MedianStudentsPerTeacher    10000 non-null  int64  
 13  numOfBathrooms              10000 non-null  float64
 14  numOfBedrooms               10000 non-null  int64  
 15  priceRange                  10000 non-null  object 
dtypes: bool(1), float64(5), int64(6), object(4)
memory usage: 1.2+ MB

We will be predicting the 'priceRange' variable. This is a multi-class problem.

Thankfully, there does not appear to be a severe class imbalance. There are fewer very expensive homes in the \$650,000+ range and fewer lower priced homes in the \$0-$250,000 range.

df_train['priceRange'].value_counts()

250000-350000    2356
350000-450000    2301
450000-650000    2275
650000+          1819
0-250000         1249
Name: priceRange, dtype: int64

Exploratory Analysis

What is an important predictor of price? Location, location, location

Austin prices are more expensive in the inner-most part of the city and are typically cheaper in the south and east sides of the city. Latitude and longitude are expected to be great predictors of price.

price_dict = {'650000+':600000,
             '350000-450000':400000, 
              '0-250000':200000, 
              '450000-650000':500000,
              '250000-350000':300000}

df_train['price'] = df_train['priceRange'].map(price_dict)

sns.scatterplot(data=df_train, x="longitude", y="latitude", hue='price', palette='Blues')
plt.legend(bbox_to_anchor=(1.05,1), loc=2)
plt.title('Austin prices differ based on location', loc='left', fontdict = {'fontsize' : 16});

The number of bedrooms and bathrooms may be useful for predicting price range. The boxplots below show that the number of bathrooms may be more helpful than number of bedrooms.

plt.figure(figsize=(10,5))
sns.boxplot(data=df_train, x="priceRange", y="numOfBathrooms",
            order=['0-250000','250000-350000', '350000-450000',
                   '450000-650000', '650000+'])
plt.title('Median # of bathrooms is 2 for home prices less than \$650K', loc='left', fontdict = {'fontsize' : 16});

plt.figure(figsize=(10,5))
sns.boxplot(data=df_train, x="priceRange", y="numOfBedrooms",
            order=['0-250000','250000-350000', '350000-450000',
                   '450000-650000', '650000+'])
plt.title('Number of Bedrooms does not appear to differ by price range', loc='left', fontdict = {'fontsize' : 16});

There are 3 categorical features besides the target. There are great opportunities for creating text features on the description feature in the future.

df_train.columns[df_train.dtypes == 'object']

Index(['city', 'description', 'homeType', 'priceRange'], dtype='object')

I've looked at home price data in the past and I've found features that are similar to the home type are significant predictors.

Let's quickly dummy encode the homeType feature for use in modeling and prepare features for modeling.

df_test['priceRange'] = -1

# join the train and test dataframes
df = pd.concat([df_train, df_test])

# dummy encoding
df = pd.get_dummies(df, columns = ['homeType'])

df_train = df.loc[df['priceRange']!=-1,:]
df_test = df.loc[df['priceRange'] == -1,:]

Modeling

num_feats = [
     'latitude',
     'longitude',
     'yearBuilt', 
     'garageSpaces',
     'numOfPatioAndPorchFeatures',
     'lotSizeSqFt', 
     'avgSchoolRating', 
     'MedianStudentsPerTeacher',
     'numOfBathrooms', 
     'numOfBedrooms',
     'hasSpa'
]

# only considering homeType dummy features. I might try description later.
cat_feats = [f for f in df_train.columns if (f.find('homeType')!=-1)]

# define the target we are going to predict
target = 'priceRange'

First, let's use KFold cross validation to build a baseline XGBoost model using default parameters to measure our Multiclass log loss.

folds = 10
kf = KFold(folds)

df_preds = pd.DataFrame()
for train_idx, test_idx in tqdm(kf.split(df_train), total=folds):
  train_data = df_train.iloc[train_idx]
  test_data = df_train.iloc[test_idx].copy()

  xgb_mod = xgb.XGBClassifier(objective = 'multi:softprob',
                              n_jobs=-1,
                              num_class = 5,
                              learning_rate=0.01,
                              n_estimators=1000,
                              random_state=20210810
                             )
  xgb_mod.fit(train_data.loc[:, num_feats+cat_feats], train_data[target])
  test_data['0-250000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,0], index=test_data.index)
  test_data['250000-350000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,1], index=test_data.index)
  test_data['350000-450000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,2], index=test_data.index)
  test_data['450000-650000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,3], index=test_data.index)
  test_data['650000+'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,4], index=test_data.index)

  df_preds = df_preds.append(test_data)

log_loss(df_preds['priceRange'], df_preds.loc[:,['0-250000', '250000-350000', '350000-450000', '450000-650000', '650000+']])

1.0048688145888969

That is not a great CV log loss but it provides a baseline for comparison after we make improvements.

Let's get serious, tune parameters, and then measure the performance improvement (hopefully).

Tuning parameters can be done in a few ways:

Manually
Greedy Grid Search
Randomized Grid Search
Bayesian optimization
... many others

Analytics Vidhya has an excellent article on different parameter tuning strategies and implementations that I highly recommend.

folds = 10
kf = KFold(folds)

df_preds = pd.DataFrame()
for train_idx, test_idx in tqdm(kf.split(df_train), total=folds):
  train_data = df_train.iloc[train_idx]
  test_data = df_train.iloc[test_idx].copy()

  xgb_mod = xgb.XGBClassifier(objective = 'multi:softprob',
                              n_jobs=-1,
                              # hyperparameters tuned using randomized grid search below
                              num_class = 5,
                              subsample= 0.8, 
                              n_estimators = 1200,
                              min_child_weight = 5, 
                              max_depth = 7,
                              learning_rate = 0.015,
                              gamma = 1,
                              colsample_bytree = 0.6,
                              random_state=20210810
                             )

  xgb_mod.fit(train_data.loc[:, num_feats+cat_feats], train_data[target])
  test_data['0-250000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,0], index=test_data.index)
  test_data['250000-350000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,1], index=test_data.index)
  test_data['350000-450000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,2], index=test_data.index)
  test_data['450000-650000'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,3], index=test_data.index)
  test_data['650000+'] = pd.DataFrame(xgb_mod.predict_proba(test_data.loc[:, num_feats+cat_feats])[:,4], index=test_data.index)

  df_preds = df_preds.append(test_data)

log_loss(df_preds['priceRange'], df_preds.loc[:,['0-250000', '250000-350000', '350000-450000', '450000-650000', '650000+']])

0.8938031872542342

A log loss of 0.89 is an 11% improvement over the baseline model of 1.00. Awesome to see the value of tuning parameters and the impact on log loss.

XGBoost feature importance plots provide insight that x, y, and z are important in the model and a,b,c are not as important

importances = xgb_mod.feature_importances_
idxs = np.argsort(importances)

plt.figure(figsize=(10,6))
plt.title('Feature Importances')
plt.barh(range(len(idxs)), importances[idxs], align='center')
plt.yticks(range(len(idxs)), [train_data.loc[:, num_feats+cat_feats].columns[i] for i in idxs])
plt.xlabel('Feature Importance')
plt.show()

Wrap up

The 2 hour time limit was a great forcing function to focus on the most necessary activities to deliver a strong model.

In future iterations I would test:

Text analysis on the description feature to extract additional information.
Explore average school rating and median students per teacher.
Possibly bring in external datasets to factor in additional socioeconomic factors.
Additional location based features.
Try additional models (LightGBM, catboost, etc.).