There are over 1.8 million parcels in Cook County, Illinois [1]. Each of these parcels is assessed every three years using three to five years of prior sales information [1]. For this analysis, we will take a look at single family home assessments within three of the 36 political townships: Lemont (19), Palos (30), Orland (28). Using data provided on the Cook County data portal, we will will create a model predicting the sale price of a property.
There exist several technical challenges with predicting sale prices of properties. 1) There exists a great degree of multicolinearity in the dataset due to the tendency for housing attributes to be associated with one another. 2) There are aspects of the home buying process that the assessor is not allowed to use as part of their assessment. Given these challenges, we we will attempt to do the best to evaluate several models applicability,
Inclusion Criteria
We have 4 inclusion criteria:
Year - 2021, 2022, 2023
Class - 202, 203, 204, 205, 206, 207, 208, 209, 210, 234, 278. These conform to single family homes
sfh2123_sales_AT <- sfh2123_sales_dt %>%lazy_dt() %>%left_join(sfh2123_assessment_dt, by =c("pin", "township_code", "class", "year"="tax_year")) %>%left_join(sfh2123_characteristics_dt, by =c("pin", "township_code", "class", "year"="tax_year")) %>%left_join(sfh2123_universe_dt, by =c("pin", "neighborhood_code", "township_code", "township_name", "class", "year"="tax_year")) %>%mutate(sale_price_ratio =10* certified_tot / sale_price) %>%collect()sfh2123_assessment_AT <- sfh2123_assessment_dt %>%lazy_dt() %>%left_join(sfh2123_characteristics_dt, by =c("pin", "township_code", "class", "tax_year")) %>%left_join(sfh2123_universe_dt, by =c("pin", "township_code", "township_name", "class", "tax_year")) %>%collect()
Check For Missingness
Code
library(ggplot2)sfh2123_sales_AT %>%mutate(id =row_number()) %>% tidyr::gather(-id, key ="key", value ="val") %>%mutate(isna =is.na(val)) %>%ggplot(aes(key, id, fill = isna)) +geom_raster(alpha=0.8) +scale_fill_manual(name ="",values =c('steelblue', 'tomato3'),labels =c("Present", "Missing")) +labs(x ="Variable",y ="Row Number", title ="Missing Values for Housing Sales") +coord_flip()
It seems like we have quite a few measures that are either entirely missing or have quite a few missing values. This is okay because we won’t be using all of our variables in order to reduce the complexity and increase the interpretability of our model.
There seems to be a high level of correlation between many of the housing characteristics. Perhaps using a model that takes in account multicollinearity would be a good idea.
library(scales)sfh2123_sales_AT %>%ggplot(aes(x = year_built, y = sale_price)) +geom_point(size =0.1) +geom_smooth(method ="gam", color ="springgreen4") +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Year Built \nand Sale Price in Cook County, IL", x ="Year Built", y ="Sale Price" ) +theme(plot.title =element_text(hjust =0.5))
More recently built houses seem to sell for more
Code
sfh2123_sales_AT %>%ggplot(aes(x = building_sqft, y = sale_price)) +geom_point(size =0.1) +geom_smooth(method ="gam", color ="springgreen4") +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Building Square Footage \nand Sale Price in Cook County, IL", x ="Building Square Footage", y ="Sale Price" ) +theme(plot.title =element_text(hjust =0.5))
Larger building houses tend to sell for more
Code
sfh2123_sales_AT %>%ggplot(aes(x = land_sqft, y = sale_price)) +geom_point(size =0.1) +geom_smooth(method ="gam", color ="springgreen4") +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Land Square Footage \nand Sale Price in Cook County, IL", x ="Land Square Footage", y ="Sale Price" ) +theme(plot.title =element_text(hjust =0.5))
The relationship here isn’t entirely clear, but it seems that more land sells for more, up until some point
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = num_bedrooms, y = sale_price, fill =factor(num_bedrooms))) +scale_x_continuous(n.breaks =9) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Number of Bedrooms \nand Sale Price in Cook County, IL", x ="Number of Bedrooms", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with more bedrooms tend to sell for more
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = num_rooms, y = sale_price, fill =factor(num_rooms))) +geom_smooth(aes(x = num_rooms, y = sale_price), method ="gam", color ="black") +scale_x_continuous(n.breaks =20) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Number of Rooms \nand Sale Price in Cook County, IL", x ="Number of Rooms", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with more rooms tend to sell for more
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = num_full_baths + num_half_baths *0.5, y = sale_price, fill =factor(num_full_baths + num_half_baths *0.5))) +geom_smooth(aes(x = num_full_baths + num_half_baths *0.5, y = sale_price), method ="gam", color ="black") +scale_x_continuous(n.breaks =20) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Total Number of Baths \nand Sale Price in Cook County, IL", x ="Total Number of Baths", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with more bathrooms tend to sell for more
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = cmap_walkability_total_score, y = sale_price, fill =factor(cmap_walkability_total_score))) +geom_smooth(aes(x = cmap_walkability_total_score, y = sale_price), method ="gam", color ="black") +scale_x_continuous(n.breaks =20) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home CMAP Walkability and \nSale Price in Cook County, IL", x ="Total Walkability Score", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with in less walkable communities tend to sell for more. Although it seems like individuals might want to live in more walkable communities. The kind of large single family homes that tend to sell for more actually contribute toward reducing walkability, which is most likely why we see a downward trend.
Mapping Our Outcome Variable
Let us take a look at how our outcome variables are distributed spatially. Clustering might indicate that there is a spatial modeling technique is required.
It seems that there is clustering of similarly priced homes within our three townships. Lemont has the highest median sale price, but Palos and Orland have similar median sale prices. Perhaps using a spatial modeling technique like a spatial Bayesian or conditional autoregressive model is waranted.
Over the course of the Cook County assessment project, I have tested several different models. All of these models exhibited RMSEs in excess of 90,000. This means that on average predictions produced by the models deviated from the true sale price by at least $90,000. Although the predictions seem quite awful, I think that these predictions point to the fact that house value prediction is a difficult problem. There exists a multitude of factors that the assessor is barred from using to predict house values and there exists a large number of factors which are not easily translated into machine learning features (e.g., housing aesthetics, human psychology, etc.).
As compared to the assessor’s model, my model seems to have a similar RMSE; however, the assessor’s model median assessment tends to be less than 85% of the sale price. So the assessor’s model continually underassesses the true value of the homes. Both the my model and the assessor’s are variants of Gradient Boosted Machines. Light GBM and XGBoost often yield similar predictions; however, LightGBM is often quite faster to train.
There are quite a few ethical concerns that come with assessing home values through the use of machine learning. As we discussed in Part 1, housing assessments, even for the year of 2023, exhibit regressivity (i.e., lower value homes are assessed a higher proportion of their true house price). This can be hugely detrimental to the residents of these homes because it has the effect of imposing the highest tax rate on those with the least. We can see from the chart below that my model also exhibits a high degree of regresivity.
Additionally, machine learning cannot capture the many human aspects of a home price, which means that there are some technical challenges with distilling aspects that affect a home’s price into features usable by a machine learning algorithm.
[1] "Filtered out non-arm's length transactions"
[1] "Inflation adjusted to 2021"
Overall, I do not think that my model is quite as effective as that of the assessor’s, and it exhibits a high degree of regressivity. I think the model could be improved by providing it with more years of data and working to refine some additional measures not currently captured by the model. I also wonder if a more comprehensive model including data from all of the Cook County townships would produce better predictions due to the larger amount of data provided. At this point, I would not recommend that the assessor use my model.