Open In Colab

For those of you who are not familiar with SLICED it is one of the fastest growing data science competitions. I've benefied from competitors sharing their work and working through the datasets. Another benefit is that the datasets are provided and cleaned by the organizers so aspiring data scientists can learn modeling without searching for datasets.

The dataset I will be analyzing in this post is from Season 1 Episode 1 of SLICED, hosted on Kaggle.

This solution will provide a top 5 result on the private leaderboard, but I encourage you to build upon the result and do additional feature engineering and CatBoost hyperparameter tuning. Let's get started.

Import libraries and data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from tqdm.notebook import tqdm
from sklearn.ensemble import RandomForestRegressor

pal = sns.color_palette()
!pip install catboost
from catboost import CatBoostRegressor
df_train = pd.read_csv("/content/drive/MyDrive/ds_projects/sliced_s01_e01/train.csv")
df_test = pd.read_csv("/content/drive/MyDrive/ds_projects/sliced_s01_e01/test.csv")

The first thing I notice about the data is that the train and test data is small. 3499 rows in the train and 1500 in the test.

df_train.shape, df_test.shape
((3499, 26), (1500, 25))

A quick look at the data shows that missing values will not be an issue. There are 12 categories of board games that are spread across columns. We'll need to stack them later.

df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3499 entries, 0 to 3498
Data columns (total 26 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   game_id      3499 non-null   int64  
 1   names        3499 non-null   object 
 2   min_players  3499 non-null   int64  
 3   max_players  3499 non-null   int64  
 4   avg_time     3499 non-null   int64  
 5   min_time     3499 non-null   int64  
 6   max_time     3499 non-null   int64  
 7   year         3499 non-null   int64  
 8   geek_rating  3499 non-null   float64
 9   num_votes    3499 non-null   int64  
 10  age          3499 non-null   int64  
 11  mechanic     3499 non-null   object 
 12  owned        3499 non-null   int64  
 13  category1    3499 non-null   object 
 14  category2    2888 non-null   object 
 15  category3    1726 non-null   object 
 16  category4    863 non-null    object 
 17  category5    401 non-null    object 
 18  category6    136 non-null    object 
 19  category7    46 non-null     object 
 20  category8    19 non-null     object 
 21  category9    5 non-null      object 
 22  category10   4 non-null      object 
 23  category11   1 non-null      object 
 24  category12   1 non-null      object 
 25  designer     3499 non-null   object 
dtypes: float64(1), int64(10), object(15)
memory usage: 710.9+ KB

Geek rating is the variable that we are tasked with predicting. I'm surprised that the scale is tighter. They must have truncated the data. Mean rating is 6.09 and the min is 5.64 and max is 8.50.

mean_rating = df_train['geek_rating'].mean()

sns.histplot(df_train['geek_rating'])
plt.axvline(mean_rating, color = 'red')
sns.despine()
plt.title(f'Geek Rating distribution with a mean of {round(mean_rating,2)}');
df_train['geek_rating'].describe().to_frame()
geek_rating
count 3499.000000
mean 6.087625
std 0.482988
min 5.640240
25% 5.726210
50% 5.904700
75% 6.302635
max 8.501630

Feature engineering numerical features

There are quite a few variables with skewed distributions in the tails. This presents opportunities for transformations.

num_feats = ['min_players', 'max_players', 'avg_time', 
             'min_time', 'max_time', 'year',
             'num_votes', 'age', 'owned']
a = len(num_feats)
b = round(len(num_feats) / 3)
c = 1

fig = plt.figure(figsize=(20,50))

for i in num_feats:
  plt.subplot(a,b,c)
  sns.histplot(df_train[i])
  plt.xlabel(i)

  c += 1

Log transforming the Owned and Number Votes variables make them normal which may be useful for modeling.

f, axes = plt.subplots(1,2, figsize=[10,5])

sns.histplot(df_train['owned'], ax = axes[0])
sns.histplot(np.log(df_train['owned']), ax = axes[1])
sns.despine()
axes[0].set_title('Owned variable has a skewed distribution')
axes[1].set_title('Log tranform Owned makes it normal');
f, axes = plt.subplots(1,2, figsize=[10,5])

sns.histplot(df_train['num_votes'], ax = axes[0])
sns.histplot(np.log(df_train['num_votes']), ax = axes[1])
sns.despine()
axes[0].set_title('Num votes variable has a skewed distribution')
axes[1].set_title('Log tranform Num votes makes it normal');

Correlation plot shows that there are some relationships between predictors. As expected, min, max, and average time are correlated.

Owned and Number votes are the most correlated with geek rating. With limited feature engineering I am sure we will see those 2 variables being important in our first model's variable importance plots.

corr = df_train.loc[:, num_feats + ['geek_rating']].corr()

mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.title('Num votes and Owned are correlated with geek rating');

Feature engineering using the categorical columns

68.8% of games have 1 designer but there are over 1,000 games that have multiple designers.

However, it is not clear from the boxplot that games that have multiple designers have higher ratings. Games that have 4 or 5 designers have slightly higher ratings but it based on a smaller sample size.

df_train['nbr_designers'] = df_train['designer'].apply(lambda x: x.count(',')+1)
# bucket 3+ designers
df_train['nbr_designers'] = df_train['nbr_designers'].apply(lambda x: 3 if x > 3 else x)
designer_df = df_train.loc[:, ['designer', 'nbr_designers','geek_rating']].reset_index()
df_train['nbr_designers'].value_counts(normalize=True)*100
1    68.762504
2    23.692484
3     7.545013
Name: nbr_designers, dtype: float64
sns.boxplot(x = 'nbr_designers', y = 'geek_rating', data = df_train)
sns.despine()
plt.title('The number of designers is not necessarily significant');

Reiner Knizia has the most games credited at 80. The closest designer (excluding uncredited) is Martin Wallace at 35 games designed.

designer_stack = df_train.loc[:, 'designer'].str.split(', ', expand=True).stack().to_frame().reset_index().loc[:, ['level_0', 0]].rename(columns = {'level_0': 'index', 0:'designer'})
designer = designer_stack.merge(designer_df, left_on='index', right_on='index')
designer.loc[designer['nbr_designers']==1, :].groupby('designer_x').agg(mean_geek_rating = ('geek_rating', np.mean) , nbr_games = ('geek_rating', len)).reset_index().sort_values(by='nbr_games',ascending=False).reset_index().head(20)
index designer_x mean_geek_rating nbr_games
0 860 Reiner Knizia 6.075032 80.0
1 0 (Uncredited) 6.031453 50.0
2 697 Martin Wallace 6.304218 35.0
3 374 Friedemann Friese 6.197290 27.0
4 1094 none 5.801660 24.0
5 609 Klaus Teuber 6.058212 24.0
6 740 Michael Schacht 6.112761 20.0
7 1052 Uwe Rosenberg 6.764786 19.0
8 920 Rüdiger Dorn 6.372285 17.0
9 209 Corné van Moorsel 5.926044 16.0
10 960 Stefan Feld 6.753361 16.0
11 644 Leo Colovini 5.901001 15.0
12 959 Stefan Dorra 6.105082 15.0
13 292 Dirk Henn 6.253423 15.0
14 16 Alan R. Moon 6.199346 15.0
15 873 Richard H. Berg 5.789186 14.0
16 279 Dean Essig 5.835333 14.0
17 611 Klaus-Jürgen Wrede 6.322541 13.0
18 978 Steve Jackson (I) 5.904472 13.0
19 1057 Vlaada Chvátil 6.648310 13.0

Nikki Valens is the most popular designer followed behind is Adam Poots.

top_solo_designers = designer.loc[designer['nbr_designers']==1, :].groupby('designer_x').agg(mean_geek_rating = ('geek_rating', np.mean) , nbr_games = ('geek_rating', len)).reset_index().sort_values(by='mean_geek_rating',ascending=False).head(20).reset_index()
bottom_solo_designers = designer.loc[designer['nbr_designers']==1, :].groupby('designer_x').agg(mean_geek_rating = ('geek_rating', np.mean) , nbr_games = ('geek_rating', len)).reset_index().sort_values(by='mean_geek_rating',ascending=True).head(20).reset_index()
plt.figure(figsize=[10,10])
sns.barplot(y = 'designer_x', x = 'mean_geek_rating', data = top_solo_designers)
sns.despine()
plt.ylabel('Designer')
plt.title('Most Popular Solo Game Designers');

Card Game, Wargame, and Fantasy categories have the most number of games.

cat_col_names = [i for i in df_train.columns.values if i.find('cat')==0]

category_stacked = df_train.loc[:, cat_col_names].stack().reset_index().rename(columns = {'level_0':'index', 0:'category'}).loc[:, ['index', 'category']]

category_stacked['category'] = category_stacked.loc[:, 'category'].apply(lambda x: x.strip())

# merge with geek rating and aggregate
geek_rating_by_category = category_stacked.merge(df_train.loc[:, ['geek_rating']].reset_index(), left_on='index', right_on='index')

geek_rating_by_category = geek_rating_by_category.groupby('category').agg(mean_geek_rating = ('geek_rating', np.mean), nbr_games = ('geek_rating',len)).reset_index()
geek_rating_by_category.sort_values('nbr_games', ascending=False).head(20)
category mean_geek_rating nbr_games
14 Card Game 6.049298 969.0
78 Wargame 5.975838 635.0
29 Fantasy 6.172004 535.0
23 Economic 6.309115 395.0
31 Fighting 6.260619 380.0
67 Science Fiction 6.190278 327.0
22 Dice 6.067265 265.0
12 Bluffing 6.112396 229.0
2 Adventure 6.238299 223.0
44 Miniatures 6.205932 223.0
81 World War II 5.922496 221.0
42 Medieval 6.229393 217.0
28 Exploration 6.165319 206.0
21 Deduction 6.128765 203.0
55 Party Game 6.057245 191.0
0 Abstract Strategy 6.014958 186.0
9 Animals 6.038774 182.0
8 Ancient 6.168602 174.0
16 City Building 6.319009 171.0
34 Humor 6.035750 145.0

Mature / Adult, Farming, and Environmental are the categories with the highest rated games.

geek_rating_by_category.sort_values('mean_geek_rating', ascending=False).head(20)
category mean_geek_rating nbr_games
39 Mature / Adult 6.541271 11.0
30 Farming 6.443360 56.0
26 Environmental 6.396710 40.0
59 Post-Napoleonic 6.374407 18.0
35 Industry / Manufacturing 6.368233 90.0
66 Renaissance 6.366714 89.0
18 Civilization 6.356147 115.0
69 Spies/Secret Agents 6.355862 46.0
16 City Building 6.319009 171.0
23 Economic 6.309115 395.0
49 Mythology 6.300905 83.0
71 Territory Building 6.295276 135.0
53 Novel-based 6.265824 87.0
31 Fighting 6.260619 380.0
58 Political 6.259269 116.0
2 Adventure 6.238299 223.0
33 Horror 6.234493 122.0
68 Space Exploration 6.234424 61.0
42 Medieval 6.229393 217.0
47 Murder/Mystery 6.222762 57.0

The lowest rated games are Music, Math, and Game System. There also quite a few war related categores in the bottom 20: American Civil War, Korean War, World War I and II.

geek_rating_by_category.sort_values('mean_geek_rating', ascending=True).head(20)
category mean_geek_rating nbr_games
48 Music 5.759700 5.0
38 Math 5.807560 8.0
32 Game System 5.820436 9.0
13 Book 5.865290 20.0
4 American Civil War 5.872653 42.0
36 Korean War 5.874700 3.0
61 Print & Play 5.879822 58.0
50 Napoleonic 5.886461 47.0
15 Children's Game 5.890584 105.0
77 Vietnam War 5.915079 12.0
45 Modern Warfare 5.917399 61.0
75 Trivia 5.922166 24.0
81 World War II 5.922496 221.0
80 World War I 5.923018 45.0
43 Memory 5.932905 38.0
1 Action / Dexterity 5.934145 134.0
78 Wargame 5.975838 635.0
79 Word Game 5.977960 44.0
54 Number 5.983398 20.0
11 Aviation / Flight 5.983438 72.0

Prepare features for modeling

df_train['owned_log'] = np.log(df_train['owned'])
df_test['owned_log'] = np.log(df_test['owned'])

df_train['num_votes_log'] = np.log(df_train['num_votes'])
df_test['num_votes_log'] = np.log(df_test['num_votes'])
df_test['geek_rating'] = -1

df = pd.concat([df_train, df_test])

cat_col_names = [i for i in df.columns.values if i.find('cat')==0]
df['category_list'] = df[cat_col_names].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
category_dummies = df['category_list'].str.get_dummies(sep=', ')

df = pd.concat([df, category_dummies], axis = 1)
df_train = df.loc[df['geek_rating']!=-1,:]
df_test = df.loc[df['geek_rating'] == -1,:]
num_feats = [
               'min_players', 
               'max_players', 
               'avg_time', 
               'min_time',
               'max_time', 
               'year', 
               'num_votes_log', 
               'age',
               'owned_log'
]

target = 'geek_rating'
df_train['category2'].fillna('none', inplace=True)
df_test['category2'].fillna('none', inplace=True)

# list cat vars
categorical_variables  = ['category1','category2', 'mechanic', 'designer']

Catboost Regressor for modeling geek rating with the defined predictor variables. I also tested XGBoost and CatBoost performs great on this dataset.

folds = 5
kf = KFold(folds)

df_preds = pd.DataFrame()

for train_idx, test_idx in tqdm(kf.split(df_train), total=folds):
  train_data = df_train.iloc[train_idx]
  test_data = df_train.iloc[test_idx].copy()

  proportion_clf = CatBoostRegressor(cat_features=categorical_variables,
                                           logging_level="Silent",
                                           loss_function="MAE")
        
  proportion_clf.fit(train_data.loc[:, num_feats+categorical_variables], train_data[target])

  test_data['preds'] = pd.DataFrame(proportion_clf.predict(test_data.loc[:, num_feats+categorical_variables]), index=test_data.index)
  df_preds = df_preds.append(test_data)

The RMSE (root mean squared error) is 0.1713 which translates to a private leaderboard score on Kaggle of 0.16855. This places us in 5th place with minimal feature engineering and hyperparameter tuning.

rmse = np.sqrt(mean_squared_error(df_preds[target], df_preds['preds']))
rmse
0.17132000754051915

Variable importance plots show that number votes and owned are most helpful for the model. Year is another important feature.

importances = proportion_clf.feature_importances_
idxs = np.argsort(importances)

plt.figure(figsize=(10,6))
plt.title('Feature Importances')
plt.barh(range(len(idxs)), importances[idxs], align='center')
plt.yticks(range(len(idxs)), [train_data.loc[:, num_feats+categorical_variables].columns[i] for i in idxs])
plt.xlabel('Feature Importance')
plt.show()

This proved to be a fun and interesting dataset. In some cases smaller datasets are not as interesting, but this is an exception. In the future I would spend time:

  1. More in-depth data exploration
  2. CatBoost hyperparameter tuning
  3. Explore modeling more categories
  4. Possibly scrape the board game website for more predictors
  5. Learn more about board games and see what new insights I can find in the data

Thanks for reading and reach out if you want to chat about the analysis.