King County Homes Challenge. Exploratory Data Analysis

The King County Homes prices prediction challenge is an excellent dataset for trying out and experimenting with various regression models. As we’ll see in the following post on Moscow flats, the modeler deals with similar challenges: skewed data and outliers, highly correlated variables (predictors), heteroskedasticity and a geographical correlation structure. Ignoring one of these may lead to undeperforming models, so in this post we’re going to carefully explore the dataset, which should inform which modeling strategy to choose.

Getting the data

First, we’ll use a Python-based command line tool (pip install kaggle) to automatically download the data from kaggle competition. An access token is needed in .kaggle directory, which is slightly different in Windows and Linux. So far these commands will do, we don’t need anything fancier.

  • kaggle datasets download -d harlfoxem/housesalesprediction
  • cp .kaggle\datasets\harlfoxem\housesalesprediction\kc_house_data.csv .
library(tidyverse) # data processing and visualization
library(ggthemes)  # themes for ggplot2
library(ggmap)     # ggplot for geographical data
library(GGally)    # ggpairs()
library(RPostgreSQL) # if you want to play around with a DBs
library(knitr)     # tables and interactive documents
library(corrplot)  # visualization of correlation matrices
library(OutliersO3) # multidimensional outlier detection
data <- read_csv("data/kc_house_data.csv")
knitr::kable(head(data %>% select(
    price, bedrooms, bathrooms, sqft_living, sqft_lot, grade, zipcode, floors
)))
price bedrooms bathrooms sqft_living sqft_lot grade zipcode floors
221900 3 1.00 1180 5650 7 98178 1
538000 3 2.25 2570 7242 7 98125 2
180000 2 1.00 770 10000 6 98028 1
604000 4 3.00 1960 5000 7 98136 1
510000 3 2.00 1680 8080 8 98074 1
1225000 4 4.50 5420 101930 11 98053 1

A side note on working with databases like MySQL and PostgreSQL. I haven’t explained much during previous blog posts how to interact with databases, which is weird because I haven’t read data from a .csv at work in almost half a year. I used MySQL and BigQuery, but as you see it takes almost no time to transition to PostgreSQL, after you get over a few setup hiccups.

Now, for storing the data we need to setup the PostgreSQL and create a database. Then we’ll try to use the R package RPostgreSQL to avoid if possible writing insert statements by “hand”. Also don’t forget to add ../PostgreSQL/9.5/bin and ../PostgreSQL/9.5/lib to the PATH system variable. Then we need to make sure the PostgreSQL server is running and create a technical user. The credentials will be in an .Renviron file. The following line uses the createuser.exe service from the PG’s /bin directory to create a new user with db.

createuser.exe --createdb --username postgres --no-createrole --pwprompt openpg

The .Renviron file is of course hidden, but here it is how it looks like:

DB_PASS="kingcounty"
DB_USER="openpg"
DB_NAME="postgres"
# load the environment variables
readRenviron(".Renviron")

# establish connection to the database
drv <- dbDriver("PostgreSQL")
conn <- dbConnect(drv, 
  dbname = Sys.getenv("DB_NAME"), 
  host = "localhost", port = 5432, 
  user = Sys.getenv("DB_USER"), 
  password = Sys.getenv("DB_PASS")
  )

# Load step: write the data into a table
dbWriteTable(conn, name = "homes", value = data, append = FALSE, row.names = FALSE)

# retrieve the data to check if it worked
data_pg <- dbGetQuery(conn, "select * from homes")

Exploratory Data Analysis

As always, before jumping into machine learning it is extremely useful to get a feel for the data through visualization and simple models. Some people prefer to do the paired feature scatterplots, histograms and then stack 13 models through scikit-learn. I’m an advocate to a more careful and thoughtful approach, having good reasons for the multiple modeling decisions which we have to make.

A thoughtful problem formulation and data representation leads us quite far into solving it.

Considering the warning above, ensembles are powerful. Even though I’m more of a Data Miner than statistician, I have to recognize that large amounts of data do not automatically solve the "econometric nightmares" we have to deal in practice. This is where statistical inference is key, working as a bridge between Machine Learning and Statistics. It is becoming increasingly important: see Efron and Hastie’s brilliant Computer Age Statistical Inference.

A variant of this challenge asked whether it makes sense to talk about price per bedroom or bathroom. And whether we could “segment” the price by other features. I don’t think segmenting is the right term or even problem formulation. “Segmenting the price” doesn’t make sense without more context. What it looks like, is a simple means to get comparable prices (“controlling” for obvious factors). The first thing you should think about is price per living area, then per total area. Then an interesting thing to look at is the number of floors, bedrooms and bathrooms. Note that, it suddenly acquires sense in the case of models (e.g. linear regression), where the parameters act as compatibilization constants for different units of measure.

df <- data %>% 
  mutate(price_bed  = price / bedrooms, 
         price_bath = price / bathrooms) %>%
  # of course, we eliminate cases with 0 bedrooms or bathrooms
  filter(price_bed != Inf & price_bath != Inf)
## Warning: package 'bindrcpp' was built under R version 3.4.4
df %>%
  select(price_bed, price_bath) %>%
  gather(key = variable, value = value) %>%
  ggplot(aes(x = value)) + 
  geom_histogram(aes(fill = variable), bins = 70, alpha = 0.5, 
                 color = NA, position = "identity") + 
  # the following is very quick and dirty, can be easily generalized
  geom_point(data = NULL, y = Inf, x = mean(df$price_bed), 
             color = "darkred", pch = 25, size = 2) +
  geom_vline(xintercept = mean(df$price_bed), 
             color = "darkred", lty = 3) + 
  geom_point(data = NULL, y = Inf, x = mean(df$price_bath), 
             color = "steelblue", pch = 25, size = 2) +
  geom_vline(xintercept = mean(df$price_bath), 
             color = "steelblue", lty = 3) + 
  geom_rug(color = "grey70") +
  theme_tufte() + 
  labs(x = "Price ($)", y = "Frequency", 
       title = "Distribution of House Prices in King County", 
       subtitle = "Per bedroom and bathroom > Doesn't make much sense") + 
  scale_color_ptol() + 
  scale_fill_ptol() + 
  theme(
    legend.position = c(0.7, 0.5), 
    axis.text.x = element_text(angle = 30, hjust = 1)) + 
  scale_x_continuous(breaks = c(mean(df$price_bath), 
                                mean(df$price_bed), 
                                5e5, 1e6, 1.5e6)) 

In the context of the problem, i.e. predicting house prices, it is not advisable to do this for feature engineering, because of leaking information into the test dataset. It makes more sense to me to look at correlations and group differences between the target and various attributes (features), which we will do next in the EDA. Note that libraries like catboost have a very smart bag of tricks to deal with categorical variables and avoid leaking. In the nutshell they don’t use the label of the particular observation to derive new features, but leverage multiple splits of data during the model training.

df <- df %>%
  mutate(log_price = log(price)) 

df %>%
  ggplot(aes(x = log_price)) + 
  geom_histogram(aes(y = ..density..), bins = 70, 
                 color = "grey70", fill = "grey80") + 
  geom_point(data = NULL, y = 0, x = mean(df$log_price), 
             color = "darkred", pch = 17, size = 2) +
  geom_vline(xintercept = mean(df$log_price), 
             color = "darkred", lty = 3) + 
  geom_rug(color = "grey70") +
  stat_function(fun = dnorm, args = list(mean = mean(df$log_price), 
                                         sd = sd(df$log_price))) + 
  theme_tufte() + 
  labs(x = "Price ($)", y = "Frequency", 
       title = "Distribution of House Prices in King County at Log-Scale")

A good advice for such datasets is to take the log, making the data more well-behaved for a large class of models. It is obvious that one of the main problems in this dataset is heteroskedasticity (i.e. increasing price variance as area increases). While taking the logarithm is clearly not enough, it would help further into analysis.

data[, 3:21] %>% 
  dplyr::select(-zipcode) %>% cor() %>% 
  corrplot::corrplot(type = "lower", order = "hclust", tl.col = "grey30", tl.cex = 0.9)

Since the data is for 12 months across two years, the study of confounding induced by seasonality is a little bit outside the scope, though, it often messes things up in practice.

# Don't use cut() here, because of outliers and skewed distribution!
df <- data %>% 
  mutate(price_bucket = ifelse(price <= 1e5, "[0,1e5)", 
   ifelse(price >= 1e5 & price < 3e5, "[1e5, 3.5e5)", 
    ifelse(price >= 3e5 & price < 8e5, "[3.5e5, 8e5)", "8e5+)"))))

set.seed(456) # for reproductibility of the sample displayed
# note that for qmplot to be working properly an internet connection is needed
qmplot(long, lat, maptype = "toner-lite", 
       data = df %>% sample_n(3000)) +
  geom_point(aes(color = price_bucket), alpha = 0.5) + 
  scale_color_ptol() + 
  labs(title = "Sample of 3000 homes by price sold") + 
  guides(color = guide_legend(title = "Price Range"))

In this case you don’t need anything more original than scatterplots on the map to get a sense of a good modeling strategy. It basically shows that the zipcode is a very promising variable for prediction, which absolutely makes sense, but we won’t have many degrees of freedom left if we do that with 20000 data points. The good news is that it leads us to the bayesian hierarchical regressions, which looks like the perfect model for such a dataseet.

Alternative approaches like CatBoost can also take advantage of such data because of the way they treat categorical features. Bayesian Additive Regression trees, if computationally feasible, are an interesting option which is more robust than Random Forests. Of course, we shouldn’t forget about XGBoost + Bayesian Optimization as the industry standard, so checking how it performs is a no brainer.

We moved quite far to the advanced models, but one should check the most naive linear regression and establish the baseline. Its assumptions of course will be violated, but being aware of that, we can incrementally build the model until we get to the optimal hierarchical bayesian representation.

qmplot(long, lat, data = df, maptype = "toner-lite") +
 geom_point(aes(color = as.factor(zipcode)), 
            alpha = 0.5, show.legend = FALSE) + 
 labs(title = "ZIP Codes")
## Using zoom = 10...

qmplot(long, lat, data = data, geom = "blank",
       maptype = "toner-lite") +
  stat_density_2d(aes(fill = ..level..), 
                  geom = "polygon", alpha = .3, color = NA) +
  scale_fill_gradient2(low = "steelblue", mid = "seagreen4", 
                        high = "indianred", midpoint = 7) +
  labs(title = "Estimated home density")
## Using zoom = 10...

There looks like it’s enough data at the zip code for the model with weakly informative priors to be reasonable. Moreover, grace to variance pooling, the model will “borrow” information from up the hierarchy in the cases in which there is not enough data to estimate the parameters.

data %>% group_by(zipcode) %>%
  summarise(nr_houses = n()) %>% 
  arrange(-nr_houses) %>% 
  head() %>%
  knitr::kable()
zipcode nr_houses
98103 602
98038 590
98115 583
98052 574
98117 553
98042 548

Notice that the log-transform of area variables looks beneficial, it only remains to be checked formally via model comparison and selection, and whether it improves on (validation) error metrics.

areas <- data %>%
  mutate(sqft_diff = sqft_living - sqft_living15) %>%
  mutate(price = log(price), sqft_living = log(sqft_living), 
         sqft_lot = log(sqft_lot), sqft_above = log(sqft_above)) %>%
  mutate(sqft_basement = ifelse(sqft_basement == 0, 0, log(sqft_basement))) %>%
  dplyr::select(price, sqft_living, sqft_above, 
                sqft_basement, sqft_lot, sqft_diff, floors) %>%
  mutate(more_floors = as.factor(floors > 1)) %>%
    sample_n(4000)
areas %>% ggpairs(columns = 1:6, aes(color = more_floors)) + 
  theme_tufte()

data %>% 
  mutate(log_price = log(price)) %>%
  mutate(floors = ifelse(floors > 2, 2, floors)) %>%
  group_by(zipcode, floors) %>% 
  summarise(mean = mean(log_price), 
            lower = mean(log_price) - 2 * sd(log_price) / n(), 
            upper = mean(log_price) + 2 * sd(log_price) / n()) %>%
  ggplot(aes(x = reorder(as.factor(zipcode), mean), y = mean, 
             color = as.factor(floors))) + 
  geom_point() + 
  coord_flip() +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0) +
  theme_minimal() + 
  labs(x = "ZIP Code", y = "Average Log-Price", 
       title = "Confidence intervals for prices by ZIP code", 
       subtitle = "As a motivation for hierarchical approach") + 
  scale_color_fivethirtyeight() + 
  guides(color = guide_legend(title = "Nr. Floors"))

It is also interesting to look at price per living square foot.

data %>% 
  mutate(log_price = log(price / sqft_living)) %>%
  mutate(floors = ifelse(floors > 2, 2, floors)) %>%
  group_by(zipcode, floors) %>% 
  summarise(mean = mean(log_price), 
            lower = mean(log_price) - 2 * sd(log_price) / n(), 
            upper = mean(log_price) + 2 * sd(log_price) / n()) %>%
  ggplot(aes(x = reorder(as.factor(zipcode), mean), y = mean, 
             color = as.factor(floors))) + 
  geom_point() + 
  coord_flip() +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0) +
  theme_minimal() + 
  labs(x = "ZIP Code", y = "Average Log-Price") + 
  scale_color_fivethirtyeight()

Similarly, one can look at the configurations including a number of bathrooms and bedrooms. There are quite a lot of insights to be uncovered from these exploratory methods, from gaining an intuition of the data and domain particularities to suggesting potential modeling strategies. The last thing to do before modeling, is to carefully think about outliers and what to do with them. The answer is that it depends: some methods are robust to it, while the majority are not. Another aspect is what kind of information do outliers encode? This looks like a classical case where it is better to remove the most extreme ones.

table(data$bathrooms, data$bedrooms) %>% 
  as.data.frame() %>% 
  filter(Freq != 0) %>%
  ggplot(aes(x = Var1, y = Var2)) + 
  geom_tile(aes(fill = Freq)) + 
  theme_tufte() + 
  coord_flip()
table(data$bathrooms, data$grade) %>% 
  as.data.frame() %>% 
  filter(Freq != 0) %>%
  ggplot(aes(x = Var1, y = Var2)) + 
  geom_tile(aes(fill = Freq)) + 
  theme_tufte() + 
  coord_flip() + 
  scale_fill_gradient(low = "steelblue", high = "gold") + 
  labs(x = "Nr. of Bathrooms", y = "Grade", 
       title = "How a home with a given number of bathrooms score?", 
       subtitle = "According to official methodology")

Outlier detection and Modeling

One could of course use a classical criteria like IQR on log price for outlier selection (which, except outliers is pretty normally distributed), but that seems rather ad-hoc. It would be much more interesting to consider the multivariate outliers across different dimensions, besides single dimensions.

set.seed(123)
idx_train <- sample(x = 1:nrow(data), 
                    size = 0.85 * nrow(data), 
                    replace = FALSE)
idx_test  <- setdiff(1:nrow(data), idx_train)

data_train <- data[idx_train, ] %>% 
  select(-c(id, date, lat, long)) %>% 
  mutate(zipcode = as.factor(zipcode))

data_test  <- data[idx_test, ] %>% 
  select(-c(id, date, lat, long)) %>% 
  mutate(zipcode = as.factor(zipcode))
df_outlier_features <- data_train %>% select(price, sqft_living, bedrooms) %>% 
  mutate(price = log(price), sqft_living = log(sqft_living))
# turns out as in Python, in R there is an amazing package for almost everything
s  <- O3prep(df_outlier_features, tols = c(0.01, 0.05), boxplotLimits = c(6, 12))
s1 <- O3plotT(s)
s1$gO3

# find a way how to extract outliers, as this is not acceptable :)
idx_outlier <- c(3918, 10953, 12994, 14662, 14789, 14887, 15155, 
  2563, 5749, 6109, 6186, 9641, 5544, 10200, 16125)

data_train <- data_train[-idx_outlier, ] %>% 
  filter(price < 4e6) # remove outlies too extreme

# notice that even at tail, these data points contain quite a bit of information
# so I would not be too quick to remove them
df <- data_train %>%
  mutate(log_price = log(price))
df %>%  ggplot(aes(x = log_price)) + 
  geom_histogram(aes(y = ..density..), bins = 70, 
                 color = "grey70", fill = "grey80") + 
  geom_point(data = NULL, y = 0, x = mean(df$log_price), 
             color = "darkred", pch = 17, size = 2) +
  geom_vline(xintercept = mean(df$log_price), 
             color = "darkred", lty = 3) + 
  geom_rug(color = "grey70") +
  stat_function(fun = dnorm, args = list(mean = mean(df$log_price), 
                                         sd = sd(df$log_price))) + 
  theme_tufte() + 
  labs(x = "Price ($)", y = "Frequency", 
       title = "Distribution of House Prices in King County at Log-Scale", 
       subtitle = "With removed outliers")

To wrap up the discussion, we investigated the potential pitfalls of predicting house prices and the particularities of such data like skewed distribution, heteroskedasticity, strongly correlated predictors and outliers. Moreover, there is very interesting research which tries to exploit this spatial structure. Multilevel modeling is a very natural choice for this problem and the exploratory analysis gives us a justification why it can be a good idea.

comments powered by Disqus