There are over 1.8 million parcels in Cook County, Illinois [1]. Each of these parcels is assessed every three years using three to five years of prior sales information [1]. For this analysis, we will take a look at single family home assessments within three of the 36 political townships: Lemont (19), Palos (30), Orland (28). Using data provided on the Cook County data portal, we will will create a series of two models. Our first model will predict the likelihood of overassessment for the year of 2022, and our second model will predict home value for the year of 2023.
Inclusion Criteria
We have 4 inclusion criteria:
Year - 2021, 2022, 2023
Class - 202, 203, 204, 205, 206, 207, 208, 209, 210, 234, 278. These conform to single family homes
sfh2123_sales_AT <- sfh2123_sales_dt %>%lazy_dt() %>%left_join(sfh2123_assessment_dt, by =c("pin", "township_code", "class", "year"="tax_year")) %>%left_join(sfh2123_characteristics_dt, by =c("pin", "township_code", "class", "year"="tax_year")) %>%left_join(sfh2123_universe_dt, by =c("pin", "neighborhood_code", "township_code", "township_name", "class", "year"="tax_year")) %>%mutate(sale_price_ratio =10* certified_tot / sale_price) %>%collect()sfh2123_assessment_AT <- sfh2123_assessment_dt %>%lazy_dt() %>%left_join(sfh2123_characteristics_dt, by =c("pin", "township_code", "class", "tax_year")) %>%left_join(sfh2123_universe_dt, by =c("pin", "township_code", "township_name", "class", "tax_year")) %>%collect()
Check For Missingness
Code
library(ggplot2)sfh2123_sales_AT %>%mutate(id =row_number()) %>% tidyr::gather(-id, key ="key", value ="val") %>%mutate(isna =is.na(val)) %>%ggplot(aes(key, id, fill = isna)) +geom_raster(alpha=0.8) +scale_fill_manual(name ="",values =c('steelblue', 'tomato3'),labels =c("Present", "Missing")) +labs(x ="Variable",y ="Row Number", title ="Missing Values for Housing Sales") +coord_flip()
It seems like we have quite a few measures that are either entirely missing or have quite a few missing values. This is okay because we won’t be using all of our variables in order to reduce the complexity and increase the interpretability of our model.
There seems to be a high level of correlation between many of the housing characteristics. Perhaps using a model that takes in account multicollinearity would be a good idea.
library(scales)sfh2123_sales_AT %>%ggplot(aes(x = year_built, y = sale_price)) +geom_point(size =0.1) +geom_smooth(method ="gam", color ="springgreen4") +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Year Built \nand Sale Price in Cook County, IL", x ="Year Built", y ="Sale Price" ) +theme(plot.title =element_text(hjust =0.5))
More recently built houses seem to sell for more
Code
sfh2123_sales_AT %>%ggplot(aes(x = building_sqft, y = sale_price)) +geom_point(size =0.1) +geom_smooth(method ="gam", color ="springgreen4") +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Building Square Footage \nand Sale Price in Cook County, IL", x ="Building Square Footage", y ="Sale Price" ) +theme(plot.title =element_text(hjust =0.5))
Larger building houses tend to sell for more
Code
sfh2123_sales_AT %>%ggplot(aes(x = land_sqft, y = sale_price)) +geom_point(size =0.1) +geom_smooth(method ="gam", color ="springgreen4") +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Land Square Footage \nand Sale Price in Cook County, IL", x ="Land Square Footage", y ="Sale Price" ) +theme(plot.title =element_text(hjust =0.5))
The relationship here isn’t entirely clear, but it seems that more land sells for more, up until some point
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = num_bedrooms, y = sale_price, fill =factor(num_bedrooms))) +scale_x_continuous(n.breaks =9) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Number of Bedrooms \nand Sale Price in Cook County, IL", x ="Number of Bedrooms", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with more bedrooms tend to sell for more
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = num_rooms, y = sale_price, fill =factor(num_rooms))) +geom_smooth(aes(x = num_rooms, y = sale_price), method ="gam", color ="black") +scale_x_continuous(n.breaks =20) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Number of Rooms \nand Sale Price in Cook County, IL", x ="Number of Rooms", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with more rooms tend to sell for more
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = num_full_baths + num_half_baths *0.5, y = sale_price, fill =factor(num_full_baths + num_half_baths *0.5))) +geom_smooth(aes(x = num_full_baths + num_half_baths *0.5, y = sale_price), method ="gam", color ="black") +scale_x_continuous(n.breaks =20) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home Total Number of Baths \nand Sale Price in Cook County, IL", x ="Total Number of Baths", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with more bathrooms tend to sell for more
Code
sfh2123_sales_AT %>%ggplot() +geom_boxplot(aes(x = cmap_walkability_total_score, y = sale_price, fill =factor(cmap_walkability_total_score))) +geom_smooth(aes(x = cmap_walkability_total_score, y = sale_price), method ="gam", color ="black") +scale_x_continuous(n.breaks =20) +scale_y_continuous(labels =label_currency()) +labs(title ="Single Family Home CMAP Walkability and \nSale Price in Cook County, IL", x ="Total Walkability Score", y ="Sale Price" ) +theme(legend.position ="none", plot.title =element_text(hjust =0.5))
Houses with in less walkable communities tend to sell for more. Although it seems like individuals might want to live in more walkable communities. The kind of large single family homes that tend to sell for more actually contribute toward reducing walkability, which is most likely why we see a downward trend.
Mapping Our Outcome Variables
Let us take a look at how our outcome variables are distributed spatially. Clustering might indicate that there is a spatial modeling technique is required.
It seems that there is clustering of similarly priced homes within our three townships. Lemont has the highest median sale price, but Palos and Orland have similar median sale prices. Perhaps using a spatial modeling technique like a spatial Bayesian or conditional autoregressive model is waranted.
For our over/under assessment model, I will define over assessment as greater than the median sale_price_ratio. Theoretically, this should be 1, but parcels are so routinely under assessed as compared to their true sale value that a threshold of 1 would place almost all houses in the under assessed category. I also chose a two class outcome in order to avoid potential complexities related to multiple classification models. I also considered creating a regression model that predicted assessment ratios and simply discretizing the predictions later, but opted against it.
It also important to note that our assessment model will only include data from 2022 and our sale price model will include data from 2021 and 2022. Although prices may be slightly different in these years,
I chose to use variables that I know are typically considered when buying a house, but we could definitely construct a workflow evaluating models with various predictors.
In order to reduce the model tuning time, we will use a repeated measure ANOVA model to eliminate tuning parameter combinations that are unlikely to yield the best results.
It seems the average deviation of our predicted prices from the true sale price is about $97,500. That is pretty high all things considered. We definitely have quite a lot of refining to do.
Building square footage is by far the most important variable within our model. This makes a lot of sense given our prior EDA and just intuitively. I think the results line up pretty fairly with what one would expect.
Part B
We are going to conduct “out-of-sample” predictions for the entire year of 2023, but first, let’s try to create some new features that may allow us to improve on our assessments.
Code
sfh_assessment_fe <- sfh2123_assessment_dt %>%lazy_dt() %>%left_join(sfh2123_characteristics_dt, by =c("pin", "township_code", "class", "tax_year")) %>%left_join(sfh2123_universe_dt, by =c("pin", "tax_year", "class", "township_code", "township_name")) %>%left_join(sfh2123_sales_dt, by =c("pin", "township_code", "class", "neighborhood_code", "tax_year"="year")) %>%collect()
We are going to get some age and education features from the ACS
Code
library(tidycensus)acs2022vars <-load_variables(2022, "acs5")tables <-list("B01001", "B15003")acs2022vars <- acs2022vars %>%mutate(table =str_sub(name, 1, 7)) %>%filter(table %in%str_c(tables, "_")) %>%mutate(table =str_sub(table, 1, 6)) %>%mutate(name =str_c("estimate_", name)) %>%mutate(label =str_sub(label, 11)) %>%separate_wider_delim(label, "!!", names =c("1st", "2nd", "3rd", "4th", "5th", "6th"), too_few ="align_start")ACS_data <-map_df(tables, \(table) get_acs(geography ="tract", state ="IL", county ="Cook", table = table, year =2022) ) %>%pivot_wider(names_from = variable, values_from =c(estimate, moe))# gets the sum of ACS estimates for a single tractget_var_rowsums <-function (ACS_data, var_df) { ACS_data %>%select(., var_df %>%select(name) %>%unlist() %>%matches() ) %>%rowSums()}processed_ACS_data <- ACS_data %>%mutate(TotalPop =get_var_rowsums(., acs2022vars %>%filter( table =="B01001",`1st`=="Total:",is.na(`2nd`) ) ),Age00_17 =get_var_rowsums(., acs2022vars %>%filter( table =="B01001",`3rd`%in%c("Under 5 years", "5 to 9 years", "10 to 14 years", "15 to 17 years"),is.na(`4th`) ) ),Age65p =get_var_rowsums(., acs2022vars %>%filter( table =="B01001",`3rd`%in%c("65 and 66 years", "67 to 69 years", "70 to 74 years", "75 to 79 years", "80 to 84 years", "85 years and over"),is.na(`4th`) ) ),Pct00_17 = Age00_17 / TotalPop,Pct65p = Age65p / TotalPop,EduHS =get_var_rowsums(., acs2022vars %>%filter( table =="B15003",`2nd`%in%c("Regular high school diploma", "GED or alternative credential", "Some college, less than 1 year", "Some college, 1 or more years, no degree"),is.na(`3rd`) ) ),TotalEdu =get_var_rowsums(., acs2022vars %>%filter( table =="B15003",`1st`=="Total:",is.na(`2nd`) ) ),EduAS =get_var_rowsums(., acs2022vars %>%filter( table =="B15003",`2nd`=="Associate's degree",is.na(`3rd`) ) ),EduBA =get_var_rowsums(., acs2022vars %>%filter( table =="B15003",`2nd`=="Bachelor's degree",is.na(`3rd`) ) ),EduGA =get_var_rowsums(., acs2022vars %>%filter( table =="B15003",`2nd`%in%c("Master's degree", "Professional school degree", "Doctorate degree"),is.na(`3rd`) ) ),PctEdHS = EduHS / TotalEdu,PctEduAS = EduAS / TotalEdu,PctEduBA = EduBA / TotalEdu,PctEduGA = EduGA / TotalEdu ) %>%select(GEOID, starts_with("Pct"))
We will combine these variables with some of our geographic variables from the geouniverse dataset to try to improve our model
Code
sfh_assessment_fe <- sfh_assessment_fe %>%left_join(processed_ACS_data, by =c("census_tract_geoid"="GEOID"))sfh_assessment_fe_known <- sfh_assessment_fe %>%filter(!is.na(sale_price))
For some reason, it seems that our model has gotten slightly worse. I wonder if the added features that we used are not truly being incorporated into the model in a meaningfully way. This could mean that our added features are simply adding noise.
It’s pretty close, but this is not as great as we think it is. We saw in Part 1 that the assessments tend to be less than 85% of the sale price, so this is not really that great.
Nevertheless, let’s see how the distribution of of the Cook County assessed prices, our modeled assessments prices, and the actual sale prices.
Code
s_aug %>%mutate(assessed_price =10*certified_tot) %>%mutate(modeled_price = .pred) %>%select(pin, ends_with("price")) %>%pivot_longer(ends_with("price"), names_to ="metric", values_to ="value") %>%ggplot(aes(value)) +geom_density(position ="stack", na.rm =TRUE) +facet_wrap(vars(metric), nrow =3) +scale_x_continuous(labels =label_currency()) +labs(title ="Sale Price Distribution of Assessed Prices, Modeled Prices, and Sale Prices")
The distribution of assessed and modeled prices seem to be tending toward the mean. The estimates seem to be tending toward mean a little. This is most likely because the unbalanced nature of our data. There are a lot of homes within the $200,000 to $600,000 range than there are outside, so the assessed and modeled prices seem to be tending toward that range.