There are over 1.8 million parcels in Cook County, Illinois [1]. Each of these parcels is assessed every three years using three to five years of prior sales information [1]. For this analysis, we will take a look at single family home assessments within three of the 36 political townships: Lemont (19), Palos (30), Orland (28). Using data provided on the Cook County data portal, we will will create a series of two models. Our first model will predict the likelihood of overassessment for the year of 2022, and our second model will predict home value for the year of 2023.
Inclusion Criteria
We have 4 inclusion criteria:
Year - 2021, 2022, 2023
Class - 202, 203, 204, 205, 206, 207, 208, 209, 210, 234, 278. These conform to single family homes
sfh2123_sales_AT <- sfh2123_sales_dt %>%lazy_dt() %>%left_join(sfh2123_assessment_dt, by =c("pin", "township_code", "class", "year"="tax_year")) %>%left_join(sfh2123_characteristics_dt, by =c("pin", "township_code", "class", "year"="tax_year")) %>%left_join(sfh2123_universe_dt, by =c("pin", "neighborhood_code", "township_code", "township_name", "class", "year"="tax_year")) %>%mutate(sale_price_ratio =10* certified_tot / sale_price) %>%collect()sfh2123_assessment_AT <- sfh2123_assessment_dt %>%lazy_dt() %>%left_join(sfh2123_characteristics_dt, by =c("pin", "township_code", "class", "tax_year")) %>%left_join(sfh2123_universe_dt, by =c("pin", "township_code", "township_name", "class", "tax_year")) %>%collect()
Check For Missingness
Code
library(ggplot2)sfh2123_sales_AT %>%mutate(id =row_number()) %>% tidyr::gather(-id, key ="key", value ="val") %>%mutate(isna =is.na(val)) %>%ggplot(aes(key, id, fill = isna)) +geom_raster(alpha=0.8) +scale_fill_manual(name ="",values =c('steelblue', 'tomato3'),labels =c("Present", "Missing")) +labs(x ="Variable",y ="Row Number", title ="Missing Values for Housing Sales") +coord_flip()
It seems like we have quite a few measures that are either entirely missing or have quite a few missing values. This is okay because we won’t be using all of our variables in order to reduce the complexity and increase the interpretability of our model.
There seems to be a high level of correlation between many of the housing characteristics. Perhaps using a model that takes in account multicollinearity would be a good idea.
library(scales)sfh2123_sales_AT %>%ggplot(aes(x = year_built, y = sale_price)) +geom_point(size =0.1) +geom_smooth(method ="gam", color ="springgreen4") +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Year Built \nand Sale Price in Cook County, IL", x ="Year Built", y ="Sale Price" ) +theme(plot.title =element_text(hjust =0.5))
More recently built houses seem to sell for more
Code
sfh2123_sales_AT %>%ggplot(aes(x = building_sqft, y = sale_price)) +geom_point(size =0.1) +geom_smooth(method ="gam", color ="springgreen4") +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Building Square Footage \nand Sale Price in Cook County, IL", x ="Building Square Footage", y ="Sale Price" ) +theme(plot.title =element_text(hjust =0.5))
Larger building houses tend to sell for more
Code
sfh2123_sales_AT %>%ggplot(aes(x = land_sqft, y = sale_price)) +geom_point(size =0.1) +geom_smooth(method ="gam", color ="springgreen4") +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Land Square Footage \nand Sale Price in Cook County, IL", x ="Land Square Footage", y ="Sale Price" ) +theme(plot.title =element_text(hjust =0.5))
The relationship here isn’t entirely clear, but it seems that more land sells for more, up until some point
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = num_bedrooms, y = sale_price, fill =factor(num_bedrooms))) +scale_x_continuous(n.breaks =9) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Number of Bedrooms \nand Sale Price in Cook County, IL", x ="Number of Bedrooms", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with more bedrooms tend to sell for more
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = num_rooms, y = sale_price, fill =factor(num_rooms))) +geom_smooth(aes(x = num_rooms, y = sale_price), method ="gam", color ="black") +scale_x_continuous(n.breaks =20) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Number of Rooms \nand Sale Price in Cook County, IL", x ="Number of Rooms", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with more rooms tend to sell for more
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = num_full_baths + num_half_baths *0.5, y = sale_price, fill =factor(num_full_baths + num_half_baths *0.5))) +geom_smooth(aes(x = num_full_baths + num_half_baths *0.5, y = sale_price), method ="gam", color ="black") +scale_x_continuous(n.breaks =20) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Total Number of Baths \nand Sale Price in Cook County, IL", x ="Total Number of Baths", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with more bathrooms tend to sell for more
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = cmap_walkability_total_score, y = sale_price, fill =factor(cmap_walkability_total_score))) +geom_smooth(aes(x = cmap_walkability_total_score, y = sale_price), method ="gam", color ="black") +scale_x_continuous(n.breaks =20) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home CMAP Walkability and \nSale Price in Cook County, IL", x ="Total Walkability Score", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with in less walkable communities tend to sell for more. Although it seems like individuals might want to live in more walkable communities. The kind of large single family homes that tend to sell for more actually contribute toward reducing walkability, which is most likely why we see a downward trend.
It seems that there is clustering of similarly priced homes within our three townships. Lemont has the highest median sale price, but Palos and Orland have similar median sale prices. Perhaps using a spatial modeling technique like a spatial Bayesian or conditional autoregressive model is waranted.
qpal2 <-colorQuantile("RdBu", sfh2123_sales_sf$sale_price_ratio, n =5)sfh2123_sales_sf %>%leaflet() %>%addProviderTiles(providers$CartoDB.Positron) %>%addPolygons(data = townships,fillColor ="black",fillOpacity =0.1,color ="black",weight =2,opacity =0.5,highlightOptions =highlightOptions(fillOpacity =0.2,weight =3),label =sprintf("<strong>%s</strong><br>Median Sale Price Ratio: %s<br/>", townships$NAME,label_number(0.001)(townships$median_spr)) %>%lapply(htmltools::HTML),labelOptions =labelOptions(style =list("font-weight"="normal", padding ="3px 8px"),textsize ="12px",direction ="auto") ) %>%addCircleMarkers(radius =3,fillColor =~qpal2(sale_price_ratio),fillOpacity =0.7,stroke =FALSE,label =sprintf("<strong>Sale Price Ratio:</strong> %s",label_number(0.001)(sfh2123_sales_sf$sale_price_ratio)) %>%lapply(htmltools::HTML),labelOptions =labelOptions(style =list("font-weight"="normal", padding ="3px 8px"),textsize ="12px",direction ="auto") ) %>%addLegend(pal = qpal2,values = sfh2123_sales_sf$sale_price_ratio,opacity =0.7,title ="Assessment to Sale Price <br>Ratio of Single Family<br> Homes in Cook County",position ="topright",na.label ="Insufficient Data",labFormat =function(type, cuts, p) { n =length(cuts) p =str_c(round(p *100), '%') cuts =str_c(label_number(0.001)(cuts[-n]), " - ", label_number(0.001)(cuts[-1]))str_c('<span title="', p[-n], " - ", p[-1], '">', cuts, '</span>') })
It seems like the over/under assessment of properties does not exhibit much clustering so perhaps there are other variables that we should have to consider. There are some values that seem like they may be large outliers.
Processing Our Data
For our over/under assessment model, I will define over assessment as greater than the median sale_price_ratio. Theoretically, this should be 1, but parcels are so routinely under assessed as compared to their true sale value that a threshold of 1 would place almost all houses in the under assessed category. I also chose a two class outcome in order to avoid potential complexities related to multiple classification models. I also considered creating a regression model that predicted assessment ratios and simply discretizing the predictions later, but opted against it.
It also important to note that our assessment model will only include data from 2022 and our sale price model will include data from 2021 and 2022. Although prices may be slightly different in these years,
I chose to use variables that I know are typically considered when buying a house, but we could definitely construct a workflow evaluating models with various predictors.
In order to reduce the model tuning time, we will use a repeated measure ANOVA model to eliminate tuning parameter combinations that are unlikely to yield the best results.
Our accuracy and auc is not that much higher than what would be expected by random chance. This is not really that encouraging for our model, but perhaps it indicates that there are no obvious trends in the overassessment/underassessment. That could reflect well on the assessor predictions.
Building square footage, land square footage, and year built seem to be the most important to this model, but its difficult to truly interpret what that means in the context of this model.
I chose to use variables that I know are typically considered when buying a house, but we could definitely construct a workflow evaluating models with various predictors.
Create Testing/Training Data and Evaluate Your Model
We are going to utilize spatial resampling to hopefully adjust for the clustering that we saw when mapping. Instead of utilizing the typical split, we have
In order to reduce the model tuning time, we will use a repeated measure ANOVA model to eliminate tuning parameter combinations that are unlikely to yield the best results.
It seems the average deviation of our predicted prices from the true sale price is about $97,500. That is pretty high all things considered. We definitely have quite a lot of refining to do.
Building square footage is by far the most important variable within our model. This makes a lot of sense given our prior EDA and just intuitively. I think the results line up pretty fairly with what one would expect.