Cook County Property Assessment - Part 4

Cook County Assessments
Author

Ryan Zomorrodi

Published

April 7, 2024

Final Submission

There are over 1.8 million parcels in Cook County, Illinois [1]. Each of these parcels is assessed every three years using three to five years of prior sales information [1]. For this analysis, we will take a look at single family home assessments within three of the 36 political townships: Lemont (19), Palos (30), Orland (28). Using data provided on the Cook County data portal, we will will create a model predicting the sale price of a property.

There exist several technical challenges with predicting sale prices of properties. 1) There exists a great degree of multicolinearity in the dataset due to the tendency for housing attributes to be associated with one another. 2) There are aspects of the home buying process that the assessor is not allowed to use as part of their assessment. Given these challenges, we we will attempt to do the best to evaluate several models applicability,

Inclusion Criteria

We have 4 inclusion criteria:

  • Year - 2021, 2022, 2023

  • Class - 202, 203, 204, 205, 206, 207, 208, 209, 210, 234, 278. These conform to single family homes

  • Township Code - 19 (Lemont), 30 (Palos), 28 (Orland)

  • Is Multisale - No

Code
library(arrow)
library(data.table)
library(dtplyr)
library(tidyverse)

sfh2123_assessment <-
  open_csv_dataset("../data/20240217_Assessed_Values.csv",
    skip = 1,
    na = c("", "[]", "NA"),
    schema(
      field("pin", string()),
      field("tax_year", int32()),
      field("class", string()),
      field("township_code", string()),
      field("township_name", string()),
      field("mailed_bldg", int32()),
      field("mailed_land", int32()),
      field("mailed_tot", int32()),
      field("certified_bldg", int32()),
      field("certified_land", int32()),
      field("certified_tot", int32()),
      field("board_bldg", int32()),
      field("board_land", int32()),
      field("board_tot", int32()),
    )
  ) %>%
  filter(tax_year %in% c(2021, 2022, 2023)) %>%
  filter(class %in% c("202", "203", "204", "205", "206", "207", "208", "209", "210", "234", "278")) %>% 
  filter(township_code %in% c("19", "30", "28"))

sfh2123_sales <- 
  open_csv_dataset("../data/20240217_Parcel_Sales.csv",
    skip = 1,
    na = c("", "[]", "NA"),
    schema(
      field("pin", string()),
      field("year", int32()),
      field("township_code", string()),
      field("neighborhood_code", string()),
      field("class", string()),
      field("sale_date", string()),
      field("is_mydec_date", bool()),
      field("sale_price", int32()),
      field("sale_document_num", string()),
      field("sale_deed_type", string()),
      field("mydec_deed_type", string()),
      field("sale_seller_name", string()),
      field("is_multisale", bool()),
      field("num_parcels_sale", int32()),
      field("sale_buyer_name", string()),
      field("sale_type", string()),
      field("sale_filter_same_sale_within_365", bool()),
      field("sale_filter_less_than_10k", bool()),
      field("sale_filter_deed_type", bool())
    )
  ) %>%
  filter(year %in% c(2021, 2022, 2023)) %>%
  filter(class %in% c("202", "203", "204", "205", "206", "207", "208", "209", "210", "234", "278")) %>% 
  filter(township_code %in% c("19", "30", "28")) %>%
  filter(!is_multisale) 

sfh2123_characteristics <- 
  open_csv_dataset("../data/20240217_Improvement_Characteristics.csv",
    skip = 1,
    na = c("", "[]", "NA"),
    schema(
      field("pin", string()),
      field("tax_year", int32()),
      field("card_num", string()),
      field("class", string()),
      field("township_code", string()),
      field("proration_key_pin", string()),
      field("pin_proration_rate", double()),
      field("card_proration_rate", double()),
      field("cdu", string()),
      field("pin_is_multicard", bool()),
      field("pin_num_cards", int32()),
      field("pin_is_multiland", bool()),
      field("pin_num_landlines", int32()),
      field("year_built", int32()),
      field("building_sqft", int32()),
      field("land_sqft", int32()),
      field("num_bedrooms", int32()),
      field("num_rooms", int32()),
      field("num_full_baths", int32()),
      field("num_half_baths", int32()),
      field("num_fireplaces", int32()),
      field("type_of_residence", string()),
      field("construction_quality", string()),
      field("num_apartments", string()),
      field("attic_finish", string()),
      field("garage_attached", string()),
      field("garage_area_included", string()),
      field("garage_size", string()),
      field("garage_ext_wall_material", string()),
      field("attic_type", string()),
      field("basement_type", string()),
      field("ext_wall_material", string()),
      field("central_heating", string()),
      field("repair_condition", string()),
      field("basement_finish", string()),
      field("roof_material", string()),
      field("single_v_multi_family", string()),
      field("site_desirability", string()),
      field("num_commercial_units", string()),
      field("renovation", string()),
      field("recent_renovation", bool()),
      field("porch", string()),
      field("central_air", string()),
      field("design_plan", string())
    )
  ) %>%
  filter(tax_year %in% c(2021, 2022, 2023)) %>%
  filter(class %in% c("202", "203", "204", "205", "206", "207", "208", "209", "210", "234", "278")) %>% 
  filter(township_code %in% c("19", "30", "28"))

sfh2123_universe <- 
  open_csv_dataset("../data/20240217_Parcel_Universe.csv",
    skip = 1,
    na = c("", "[]", "NA"),
    schema(
      field("pin", string()),
      field("pin10", string()),
      field("tax_year", int32()),
      field("class", string()),
      field("triad_name", string()),
      field("triad_code", string()),
      field("township_name", string()),
      field("township_code", string()),
      field("neighborhood_code", string()),
      field("tax_district_code", string()),
      field("zip_code", string()),
      field("longitude", double()),
      field("latitude", double()),
      field("centroid_x_crs_3435", double()),
      field("centroid_y_crs_3435", double()),
      field("census_block_group_geoid", string()),
      field("census_block_geoid", string()),
      field("census_congressional_district_geoid", string()),
      field("census_county_subdivision_geoid", string()),
      field("census_place_geoid", string()),
      field("census_puma_geoid", string()),
      field("census_school_district_elementary_geoid", string()),
      field("census_school_district_secondary_geoid", string()),
      field("census_school_district_unified_geoid", string()),
      field("census_state_representative_geoid", string()),
      field("census_state_senate_geoid", string()),
      field("census_tract_geoid", string()),
      field("census_zcta_geoid", string()),
      field("census_data_year", int32()),
      field("census_acs5_congressional_district_geoid", string()),
      field("census_acs5_county_subdivision_geoid", string()),
      field("census_acs5_place_geoid", string()),
      field("census_acs5_puma_geoid", string()),
      field("census_acs5_school_district_elementary_geoid", string()),
      field("census_acs5_school_district_secondary_geoid", string()),
      field("census_acs5_school_district_unified_geoid", string()),
      field("census_acs5_state_representative_geoid", string()),
      field("census_acs5_state_senate_geoid", string()),
      field("census_acs5_tract_geoid", string()),
      field("census_acs5_data_year", int32()),
      field("board_of_review_district_num", string()),
      field("board_of_review_district_data_year", int32()),
      field("commissioner_district_num", string()),
      field("commissioner_district_data_year", int32()),
      field("judicial_district_num", string()),
      field("judicial_district_data_year", int32()),
      field("ward_num", string()),
      field("ward_chicago_data_year", int32()),
      field("ward_evanston_data_year", int32()),
      field("chicago_community_area_num", string()),
      field("chicago_community_area_name", string()),
      field("chicago_community_area_data_year", int32()),
      field("chicago_industrial_corridor_num", string()),
      field("chicago_industrial_corridor_name", string()),
      field("chicago_industrial_corridor_data_year", int32()),
      field("chicago_police_district_num", string()),
      field("chicago_police_district_data_year", int32()),
      field("coordinated_care_area_num", string()),
      field("coordinated_care_area_data_year", int32()),
      field("enterprise_zone_num", string()),
      field("enterprise_zone_data_year", int32()),
      field("industrial_growth_zone_num", string()),
      field("industrial_growth_zone_data_year", int32()),
      field("qualified_opportunity_zone_num", string()),
      field("qualified_opportunity_zone_data_year", int32()),
      field("flood_fema_sfha", bool()),
      field("flood_fema_data_year", int32()),
      field("flood_fs_factor", int32()),
      field("flood_fs_risk_direction", int32()),
      field("flood_fs_data_year", int32()),
      field("ohare_noise_contour_no_buffer_bool", bool()),
      field("ohare_noise_contour_half_mile_buffer_bool", bool()),
      field("ohare_noise_contour_data_year", int32()),
      field("airport_noise_dnl", double()),
      field("airport_noise_data_year", string()),
      field("school_elementary_district_geoid", string()),
      field("school_elementary_district_name", string()),
      field("school_secondary_district_geoid", string()),
      field("school_secondary_district_name", string()),
      field("school_unified_district_geoid", string()),
      field("school_unified_district_name", string()),
      field("school_year", string()),
      field("school_data_year", int32()),
      field("tax_municipality_num", string()),
      field("tax_municipality_name", string()),
      field("tax_school_elementary_district_num", string()),
      field("tax_school_elementary_district_name", string()),
      field("tax_school_secondary_district_num", string()),
      field("tax_school_secondary_district_name", string()),
      field("tax_school_unified_district_num", string()),
      field("tax_school_unified_district_name", string()),
      field("tax_community_college_district_num", string()),
      field("tax_community_college_district_name", string()),
      field("tax_fire_protection_district_num", string()),
      field("tax_fire_protection_district_name", string()),
      field("tax_library_district_num", string()),
      field("tax_library_district_name", string()),
      field("tax_park_district_num", string()),
      field("tax_park_district_name", string()),
      field("tax_sanitation_district_num", string()),
      field("tax_sanitation_district_name", string()),
      field("tax_special_service_area_num", string()),
      field("tax_special_service_area_name", string()),
      field("tax_tif_district_num", string()),
      field("tax_tif_district_name", string()),
      field("tax_districts_data_year", int32()),
      field("cmap_walkability_grid_id", string()),
      field("cmap_walkability_no_transit_score", double()),
      field("cmap_walkability_total_score", double()),
      field("cmap_walkability_data_year", int32()),
      field("subdivision_id", string()),
      field("subdivision_data_year", int32())
    )
  ) %>%
  filter(tax_year %in% c(2021, 2022, 2023)) %>%
  filter(class %in% c("202", "203", "204", "205", "206", "207", "208", "209", "210", "234", "278")) %>% 
  filter(township_code %in% c("19", "30", "28"))

sfh2123_assessment_dt <- sfh2123_assessment %>% as.data.table()
sfh2123_sales_dt <- sfh2123_sales %>% as.data.table()
sfh2123_characteristics_dt <- sfh2123_characteristics %>% as.data.table()
sfh2123_universe_dt <- sfh2123_universe %>% as.data.table()

Data Joining

Code
sfh2123_sales_AT <- sfh2123_sales_dt %>%
  lazy_dt() %>%
  left_join(sfh2123_assessment_dt, by = c("pin", "township_code", "class", "year" = "tax_year")) %>% 
  left_join(sfh2123_characteristics_dt, by = c("pin", "township_code", "class", "year" = "tax_year")) %>% 
  left_join(sfh2123_universe_dt, by = c("pin", "neighborhood_code", "township_code", "township_name", "class", "year" = "tax_year")) %>%
  mutate(sale_price_ratio = 10 * certified_tot / sale_price) %>%
  collect()

sfh2123_assessment_AT <- sfh2123_assessment_dt %>% 
  lazy_dt() %>%
  left_join(sfh2123_characteristics_dt, by = c("pin", "township_code", "class", "tax_year")) %>% 
  left_join(sfh2123_universe_dt, by = c("pin", "township_code", "township_name", "class", "tax_year")) %>%
  collect()

Check For Missingness

Code
library(ggplot2)

sfh2123_sales_AT %>%
  mutate(id = row_number()) %>%
  tidyr::gather(-id, key = "key", value = "val") %>%
  mutate(isna = is.na(val)) %>%
  ggplot(aes(key, id, fill = isna)) +
    geom_raster(alpha=0.8) +
    scale_fill_manual(name = "",
        values = c('steelblue', 'tomato3'),
        labels = c("Present", "Missing")) +
    labs(x = "Variable",
           y = "Row Number", title = "Missing Values for Housing Sales") +
    coord_flip()

It seems like we have quite a few measures that are either entirely missing or have quite a few missing values. This is okay because we won’t be using all of our variables in order to reduce the complexity and increase the interpretability of our model.

Check For Correlation

Code
library(corrr)

sfh2123_sales_AT %>%
  select(c(sale_price, certified_tot, year_built, building_sqft, land_sqft, num_bedrooms, num_rooms, num_full_baths, num_half_baths, num_fireplaces, cmap_walkability_no_transit_score, cmap_walkability_total_score)) %>%
  select(where(is.numeric)) %>%
  select(where(function(x) sum(!is.na(x)) != 0)) %>%
  select(where(function(x) var(x, na.rm = TRUE) != 0)) %>% 
  correlate(method = "spearman") %>%
  autoplot()

There seems to be a high level of correlation between many of the housing characteristics. Perhaps using a model that takes in account multicollinearity would be a good idea.

Investigating Our Variables

Code
library(scales)

sfh2123_sales_AT %>%
  ggplot(aes(x = year_built, y = sale_price)) +
    geom_point(size = 0.1) +
    geom_smooth(method = "gam", color = "springgreen4") +
    scale_y_continuous(labels = label_currency()) + 
    labs(
      title = "Single Family Home Year Built \nand Sale Price in Cook County, IL", 
      x = "Year Built", 
      y = "Sale Price"
    ) +
    theme(plot.title = element_text(hjust = 0.5))

More recently built houses seem to sell for more

Code
sfh2123_sales_AT %>%
  ggplot(aes(x = building_sqft, y = sale_price)) +
    geom_point(size = 0.1) +
    geom_smooth(method = "gam", color = "springgreen4") +
    scale_y_continuous(labels = label_currency()) + 
    labs(
      title = "Single Family Home Building Square Footage \nand Sale Price in Cook County, IL", 
      x = "Building Square Footage", 
      y = "Sale Price"
    ) +
    theme(plot.title = element_text(hjust = 0.5))

Larger building houses tend to sell for more

Code
sfh2123_sales_AT %>%
  ggplot(aes(x = land_sqft, y = sale_price)) +
    geom_point(size = 0.1) +
    geom_smooth(method = "gam", color = "springgreen4") +
    scale_y_continuous(labels = label_currency()) + 
    labs(
      title = "Single Family Home Land Square Footage \nand Sale Price in Cook County, IL", 
      x = "Land Square Footage", 
      y = "Sale Price"
    ) +
    theme(plot.title = element_text(hjust = 0.5))

The relationship here isn’t entirely clear, but it seems that more land sells for more, up until some point

Code
sfh2123_sales_AT %>%
  ggplot() +
    geom_boxplot(aes(x = num_bedrooms, y = sale_price, fill = factor(num_bedrooms))) +
    scale_x_continuous(n.breaks = 9) +
    scale_y_continuous(labels = label_currency()) +
    labs(
      title = "Single Family Home Number of Bedrooms \nand Sale Price in Cook County, IL", 
      x = "Number of Bedrooms", 
      y = "Sale Price"
    ) + 
    theme(legend.position = "none", plot.title = element_text(hjust = 0.5))

Houses with more bedrooms tend to sell for more

Code
sfh2123_sales_AT %>%
  ggplot() +
    geom_boxplot(aes(x = num_rooms, y = sale_price, fill = factor(num_rooms))) +
    geom_smooth(aes(x = num_rooms, y = sale_price), method = "gam", color = "black") +
    scale_x_continuous(n.breaks = 20) +
    scale_y_continuous(labels = label_currency()) +
    labs(
      title = "Single Family Home Number of Rooms \nand Sale Price in Cook County, IL", 
      x = "Number of Rooms", 
      y = "Sale Price"
    ) + 
    theme(legend.position = "none", plot.title = element_text(hjust = 0.5))

Houses with more rooms tend to sell for more

Code
sfh2123_sales_AT %>%
  ggplot() +
    geom_boxplot(aes(x = num_full_baths + num_half_baths * 0.5, y = sale_price, fill = factor(num_full_baths + num_half_baths * 0.5))) +
    geom_smooth(aes(x = num_full_baths + num_half_baths * 0.5, y = sale_price), method = "gam", color = "black") +
    scale_x_continuous(n.breaks = 20) +
    scale_y_continuous(labels = label_currency()) +
    labs(
      title = "Single Family Home Total Number of Baths \nand Sale Price in Cook County, IL", 
      x = "Total Number of Baths", 
      y = "Sale Price"
    ) + 
    theme(legend.position = "none", plot.title = element_text(hjust = 0.5))

Houses with more bathrooms tend to sell for more

Code
sfh2123_sales_AT %>%
  ggplot() +
    geom_boxplot(aes(x = cmap_walkability_total_score, y = sale_price, fill = factor(cmap_walkability_total_score))) +
    geom_smooth(aes(x = cmap_walkability_total_score, y = sale_price), method = "gam", color = "black") +
    scale_x_continuous(n.breaks = 20) +
    scale_y_continuous(labels = label_currency()) +
    labs(
      title = "Single Family Home CMAP Walkability and \nSale Price in Cook County, IL", 
      x = "Total Walkability Score", 
      y = "Sale Price"
    ) + 
    theme(legend.position = "none", plot.title = element_text(hjust = 0.5))

Houses with in less walkable communities tend to sell for more. Although it seems like individuals might want to live in more walkable communities. The kind of large single family homes that tend to sell for more actually contribute toward reducing walkability, which is most likely why we see a downward trend.

Mapping Our Outcome Variable

Let us take a look at how our outcome variables are distributed spatially. Clustering might indicate that there is a spatial modeling technique is required.

Code
library(sf)

townships <- read_sf("https://gis.cookcountyil.gov/traditional/rest/services/politicalBoundary/MapServer/3/query?outFields=*&where=1%3D1&f=geojson") %>%
  filter(NAME %in% c("LEMONT", "PALOS", "ORLAND")) %>%
  left_join(
    sfh2123_sales_AT %>%
      group_by(township_code) %>%
      summarize(
        median_sp = median(sale_price), 
        median_spr = median(sale_price_ratio))  %>%
      transmute(median_sp, median_spr, NAME = case_when(
        township_code == "19" ~ "LEMONT",
        township_code == "30" ~ "PALOS",
        township_code == "28" ~ "ORLAND",
      )
    )
  )

sfh2123_sales_sf <- sfh2123_sales_AT %>% 
  st_as_sf(coords = c("longitude", "latitude"))
Code
library(leaflet)

qpal1 <- colorQuantile("Greens", sfh2123_sales_sf$sale_price, n = 5)

sfh2123_sales_sf %>%
  leaflet() %>%
    addProviderTiles(providers$CartoDB.Positron) %>% 
    addPolygons(
      data = townships,
      fillColor = "black",
      fillOpacity = 0.1,
      color = "black",
      weight = 2,
      opacity = 0.5,
      highlightOptions = highlightOptions(
        fillOpacity = 0.2,
        weight = 3),
      label = sprintf(
        "<strong>%s</strong><br>Median Sale: %s<br/>",
        townships$NAME,
        label_currency()(townships$median_sp)) %>% lapply(htmltools::HTML),
      labelOptions = labelOptions(
          style = list("font-weight" = "normal", padding = "3px 8px"),
          textsize = "12px",
          direction = "auto")
    ) %>%
    addCircleMarkers(
      radius = 3,
      fillColor = ~ qpal1(sale_price),
      fillOpacity = 0.7,
      stroke = FALSE,
      label = sprintf(
        "<strong>Sale Price:</strong> %s",
        label_currency()(sfh2123_sales_sf$sale_price)) %>% lapply(htmltools::HTML),
      labelOptions = labelOptions(
          style = list("font-weight" = "normal", padding = "3px 8px"),
          textsize = "12px",
          direction = "auto")
    ) %>%
    addLegend(
      pal = qpal1,
      values = sfh2123_sales_sf$sale_price,
      opacity = 0.7,
      title = "Sale Price of Single <br>Family Homes in Cook <br>County",
      position = "topright",
      na.label = "Insufficient Data",
      labFormat = function(type, cuts, p) {
        n = length(cuts)
        p = str_c(round(p * 100), '%')
        cuts = str_c(label_currency()(cuts[-n]), " - ", label_currency()(cuts[-1]))
        str_c('<span title="', p[-n], " - ", p[-1], '">', cuts, '</span>')
      })

It seems that there is clustering of similarly priced homes within our three townships. Lemont has the highest median sale price, but Palos and Orland have similar median sale prices. Perhaps using a spatial modeling technique like a spatial Bayesian or conditional autoregressive model is waranted.

Code
library(spdep)

knn <- knearneigh(sfh2123_sales_sf, k = 20)
nb <- knn2nb(knn)
weights <- nb2listw(nb, style = "B")

moran.test(x = sfh2123_sales_sf$sale_price, listw = weights, zero.policy = TRUE) %>% 
  broom::tidy() %>% 
  select(estimate1, p.value, method) %>%
  rename(`Moran I statistic` = estimate1) %>%
  knitr::kable()
Moran I statistic p.value method
0.4998853 0 Moran I test under randomisation

Processing Our Data

Code
sfh2123_sales_data <- sfh2123_sales_AT %>%
  group_by(pin) %>%
  slice_tail() %>% 
  filter(year != 2023) %>%
  mutate(month = str_extract(sale_date, "^([:alpha:]*)(?= )"))

Getting ACS Data

Code
library(tidycensus)

acs2022vars <- load_variables(2022, "acs5")
tables <- list("B01001", "B15003")

acs2022vars <- acs2022vars %>%
  mutate(table = str_sub(name, 1, 7)) %>%
  filter(table %in% str_c(tables, "_")) %>% 
  mutate(table = str_sub(table, 1, 6)) %>% 
  mutate(name = str_c("estimate_", name)) %>%
  mutate(label = str_sub(label, 11)) %>%
  separate_wider_delim(label, "!!", names = c("1st", "2nd", "3rd", "4th", "5th", "6th"), too_few = "align_start")

ACS_data <- map_df(tables,
    \(table) get_acs(geography = "tract", state ="IL", county = "Cook", table = table, year = 2022)
  ) %>%
  pivot_wider(names_from = variable, values_from = c(estimate, moe))

# gets the sum of ACS estimates for a single tract
get_var_rowsums <- function (ACS_data, var_df) {
  ACS_data %>%
    select(.,
      var_df %>%
        select(name) %>%
        unlist() %>%
        matches()
    ) %>% rowSums()
}

processed_ACS_data <- ACS_data %>%
  mutate(
    TotalPop = get_var_rowsums(.,
      acs2022vars %>%
        filter(
          table == "B01001",
          `1st` == "Total:",
          is.na(`2nd`)
        )
      ),
    Age00_17 = get_var_rowsums(.,
      acs2022vars %>%
        filter(
          table == "B01001",
          `3rd` %in% c("Under 5 years", "5 to 9 years", "10 to 14 years", "15 to 17 years"),
          is.na(`4th`)
        )
      ),
    Age65p = get_var_rowsums(.,
      acs2022vars %>%
        filter(
          table == "B01001",
          `3rd` %in% c("65 and 66 years", "67 to 69 years", "70 to 74 years", "75 to 79 years", "80 to 84 years", "85 years and over"),
          is.na(`4th`)
        )
      ),
    Pct00_17 = Age00_17 / TotalPop,
    Pct65p = Age65p / TotalPop,
    EduHS = get_var_rowsums(.,
      acs2022vars %>%
        filter(
          table == "B15003",
          `2nd` %in% c("Regular high school diploma", "GED or alternative credential", "Some college, less than 1 year", "Some college, 1 or more years, no degree"),
          is.na(`3rd`)
        )
      ),
    TotalEdu = get_var_rowsums(.,
      acs2022vars %>%
        filter(
          table == "B15003",
          `1st` == "Total:",
          is.na(`2nd`)
        )
      ),
    EduAS = get_var_rowsums(.,
      acs2022vars %>%
        filter(
          table == "B15003",
          `2nd` == "Associate's degree",
          is.na(`3rd`)
        )
      ),
    EduBA = get_var_rowsums(.,
      acs2022vars %>%
        filter(
          table == "B15003",
          `2nd` == "Bachelor's degree",
          is.na(`3rd`)
        )
      ),
    EduGA = get_var_rowsums(.,
      acs2022vars %>%
        filter(
          table == "B15003",
          `2nd` %in% c("Master's degree", "Professional school degree", "Doctorate degree"),
          is.na(`3rd`)
        )
      ),
    PctEdHS = EduHS / TotalEdu,
    PctEduAS = EduAS / TotalEdu,
    PctEduBA = EduBA / TotalEdu,
    PctEduGA = EduGA / TotalEdu
  ) %>%
  select(GEOID, starts_with("Pct"))
Code
sales_data <- sfh2123_sales_data %>%
  left_join(processed_ACS_data, by = c("census_tract_geoid" = "GEOID"))

Evaluating Different Models

Code
library(tidymodels)
tidymodels_prefer()

Creating Test/Training Data

Code
set.seed(123) 
sales_split <- initial_split(sales_data, strata = sale_price)
sales_train <- training(sales_split)
sales_test  <- testing(sales_split)

set.seed(234) 
sales_resamples <- vfold_cv(sales_train, v = 5, strata = sale_price)

Creating Our Recipes

Code
library(rules)

default_recipe <- sales_data %>%    
  recipe(sale_price ~ year_built + building_sqft + land_sqft + num_bedrooms + num_rooms + num_full_baths + num_half_baths + num_fireplaces + type_of_residence + construction_quality + attic_finish + garage_size + ext_wall_material + basement_type + central_heating + roof_material + porch + central_air + neighborhood_code + school_elementary_district_name + Pct00_17 + Pct65p + PctEdHS + PctEduAS + PctEduBA + PctEduGA, data = .) %>%    
  step_string2factor(all_string()) %>%    
  step_impute_knn(all_predictors()) %>%   
  step_novel(all_nominal_predictors()) %>%   
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

normalized_recipe <- default_recipe %>%
  step_normalize(all_predictors())

Creating Our Models

Code
linear_reg_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")

rf_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 500) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

xgb_spec <- 
  boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), min_n = tune(), sample_size = tune(), trees = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

cubist_spec <- 
   cubist_rules(committees = tune(), neighbors = tune()) %>% 
   set_engine("Cubist") 

Creating Our Workflows

Code
normalized <- workflow_set(
  preproc = list(normalized = normalized_recipe),
  models = list(linear_reg = linear_reg_spec))

no_pre_proc <- workflow_set(
  preproc = list(simple = default_recipe), 
  models = list(RF = rf_spec, boosting = xgb_spec, Cubist = cubist_spec))
Code
all_workflows <- bind_rows(no_pre_proc, normalized)
Code
library(finetune)

grid_results <- all_workflows %>%
  workflow_map(
    "tune_race_anova",
    seed = 345,
    resamples = sales_resamples,
    grid = 20,
    control = control_race(
      save_pred = FALSE,
      save_workflow = FALSE,
      parallel_over = "everything"),
    verbose = TRUE)

Plotting Candidate Sets

Code
grid_results %>% 
  rank_results() %>%
  filter(.metric == "rmse") %>% 
  select(model, .config, rmse = mean, rank) %>%
  knitr::kable()
model .config rmse rank
boost_tree Preprocessor1_Model06 99370.41 1
boost_tree Preprocessor1_Model03 99735.99 2
boost_tree Preprocessor1_Model12 99814.43 3
cubist_rules Preprocessor1_Model11 100449.81 4
linear_reg Preprocessor1_Model02 100467.61 5
linear_reg Preprocessor1_Model09 100468.87 6
linear_reg Preprocessor1_Model06 100469.72 7
linear_reg Preprocessor1_Model04 100471.87 8
linear_reg Preprocessor1_Model05 100473.77 9
linear_reg Preprocessor1_Model14 100495.85 10
linear_reg Preprocessor1_Model17 100509.72 11
linear_reg Preprocessor1_Model07 100596.77 12
linear_reg Preprocessor1_Model03 100607.02 13
linear_reg Preprocessor1_Model08 100609.04 14
linear_reg Preprocessor1_Model10 100618.31 15
linear_reg Preprocessor1_Model01 100619.02 16
linear_reg Preprocessor1_Model11 100623.54 17
linear_reg Preprocessor1_Model12 100623.87 18
linear_reg Preprocessor1_Model13 100625.73 19
linear_reg Preprocessor1_Model20 100626.77 20
linear_reg Preprocessor1_Model15 100636.52 21
linear_reg Preprocessor1_Model18 100639.10 22
linear_reg Preprocessor1_Model16 100639.61 23
linear_reg Preprocessor1_Model19 100639.78 24
cubist_rules Preprocessor1_Model18 100685.93 25
boost_tree Preprocessor1_Model17 100872.65 26
cubist_rules Preprocessor1_Model10 101008.70 27
rand_forest Preprocessor1_Model10 101067.44 28
cubist_rules Preprocessor1_Model05 101168.22 29
rand_forest Preprocessor1_Model20 101281.61 30
cubist_rules Preprocessor1_Model13 101345.31 31
rand_forest Preprocessor1_Model04 101447.32 32
rand_forest Preprocessor1_Model06 101563.37 33
rand_forest Preprocessor1_Model08 101605.92 34
rand_forest Preprocessor1_Model09 101621.71 35
rand_forest Preprocessor1_Model15 101802.36 36
rand_forest Preprocessor1_Model13 101962.18 37
cubist_rules Preprocessor1_Model15 102245.67 38
rand_forest Preprocessor1_Model07 102303.67 39
cubist_rules Preprocessor1_Model08 102579.22 40
rand_forest Preprocessor1_Model11 102676.19 41
cubist_rules Preprocessor1_Model03 102720.11 42
cubist_rules Preprocessor1_Model02 102905.93 43
rand_forest Preprocessor1_Model12 102954.36 44
rand_forest Preprocessor1_Model05 103011.92 45
rand_forest Preprocessor1_Model19 103225.46 46
rand_forest Preprocessor1_Model17 103241.71 47
rand_forest Preprocessor1_Model02 103569.38 48
boost_tree Preprocessor1_Model14 103886.69 49
rand_forest Preprocessor1_Model01 103956.00 50
boost_tree Preprocessor1_Model07 104158.37 51
rand_forest Preprocessor1_Model16 104162.49 52
boost_tree Preprocessor1_Model16 104173.77 53
boost_tree Preprocessor1_Model20 104218.78 54
boost_tree Preprocessor1_Model15 104251.63 55
boost_tree Preprocessor1_Model10 104264.46 56
boost_tree Preprocessor1_Model01 104790.26 57
boost_tree Preprocessor1_Model09 104876.53 58
rand_forest Preprocessor1_Model18 104881.21 59
rand_forest Preprocessor1_Model03 104989.63 60
Code
autoplot(
  grid_results,
  rank_metric = "rmse",
  metric = "rmse", 
  select_best = TRUE
 ) +
  geom_text(aes(y = mean - 1/2, label = wflow_id), angle = 90, vjust = -0.5) +
  theme(legend.position = "none")

Select Best Model

Code
best_wf <- grid_results %>% 
  rank_results() %>%
  filter(.metric == "rmse", rank == 1) %>%
  pull(wflow_id)

best_results <- grid_results %>%
  extract_workflow_set_result(best_wf) %>% 
  select_best(metric = "rmse") 

best_results_fit <- 
  grid_results %>% 
  extract_workflow(best_wf) %>%
  finalize_workflow(best_results) %>% 
  last_fit(split = sales_split)

Check Predicted VS Observed

Code
best_results_fit %>% 
  collect_predictions() %>% 
  ggplot(aes(x = sale_price, y = .pred)) + 
  geom_abline(color = "gray50", lty = 2) + 
  geom_point(alpha = 0.5) + 
  coord_obs_pred() + 
  labs(x = "observed", y = "predicted")

Further Tune Selected Model

Code
bayes_results <- grid_results %>% 
  extract_workflow(best_wf) %>%
  tune_bayes(
    resamples = sales_resamples,
    initial = 6,
    iter = 25,
    control = control_bayes(verbose = TRUE)
  )

Finalize model

Code
bayes_results %>%
  show_best(metric = "rmse") %>% 
  knitr::kable()
trees min_n tree_depth learn_rate loss_reduction sample_size .metric .estimator mean n std_err .config .iter
1531 13 2 0.0145856 0.0000095 0.2282499 rmse standard 98862.69 5 10018.769 Preprocessor1_Model2 0
402 13 5 0.0121933 24.3073101 0.1917696 rmse standard 100612.56 5 10541.353 Iter10 10
833 35 10 0.0055546 6.5822950 0.4633506 rmse standard 103436.17 5 10745.935 Preprocessor1_Model6 0
1516 29 1 0.0153796 30.9923970 0.7996068 rmse standard 103598.27 5 8617.141 Iter2 2
1224 33 8 0.0090024 0.0005782 0.2168532 rmse standard 104194.76 5 10192.453 Iter1 1

Final fit

Code
final_fit <- grid_results %>% 
  extract_workflow(best_wf) %>%
  finalize_workflow(select_best(bayes_results, metric = "rmse")) %>% 
  last_fit(split = sales_split)

MAPE and RMSE

Code
sales_preds <- final_fit %>% 
  pull(.predictions) %>% 
  pluck(1)

sales_metrics <- list(   
  sales_preds %>%     
    rmse(sale_price, .pred),   
  sales_preds %>%     
    mape(sale_price, .pred)) 

sales_metrics %>% bind_rows() %>% knitr::kable()
.metric .estimator .estimate
rmse standard 93988.148
mape standard 16.027

Conclusion

Over the course of the Cook County assessment project, I have tested several different models. All of these models exhibited RMSEs in excess of 90,000. This means that on average predictions produced by the models deviated from the true sale price by at least $90,000. Although the predictions seem quite awful, I think that these predictions point to the fact that house value prediction is a difficult problem. There exists a multitude of factors that the assessor is barred from using to predict house values and there exists a large number of factors which are not easily translated into machine learning features (e.g., housing aesthetics, human psychology, etc.).

As compared to the assessor’s model, my model seems to have a similar RMSE; however, the assessor’s model median assessment tends to be less than 85% of the sale price. So the assessor’s model continually underassesses the true value of the homes. Both the my model and the assessor’s are variants of Gradient Boosted Machines. Light GBM and XGBoost often yield similar predictions; however, LightGBM is often quite faster to train.

There are quite a few ethical concerns that come with assessing home values through the use of machine learning. As we discussed in Part 1, housing assessments, even for the year of 2023, exhibit regressivity (i.e., lower value homes are assessed a higher proportion of their true house price). This can be hugely detrimental to the residents of these homes because it has the effect of imposing the highest tax rate on those with the least. We can see from the chart below that my model also exhibits a high degree of regresivity.

Additionally, machine learning cannot capture the many human aspects of a home price, which means that there are some technical challenges with distilling aspects that affect a home’s price into features usable by a machine learning algorithm.

Code
library(cmfproperty)

ratios <- grid_results %>% 
  extract_workflow(best_wf) %>%
  finalize_workflow(select_best(bayes_results, metric = "rmse")) %>%
  fit(sales_train) %>%
  augment(sales_test) %>%
  select(pin, year, sale_price, .pred) %>%
  reformat_data(
    sale_col = "sale_price",
    assessment_col = ".pred",
    sale_year_col = "year")

binned <- binned_scatter(ratios,
    min_reporting_yr = 2021,
    max_reporting_yr = 2023,
    jurisdiction_name = "Cook County, IL")

binned[[2]]

[1] "Filtered out non-arm's length transactions"
[1] "Inflation adjusted to 2021"

Overall, I do not think that my model is quite as effective as that of the assessor’s, and it exhibits a high degree of regressivity. I think the model could be improved by providing it with more years of data and working to refine some additional measures not currently captured by the model. I also wonder if a more comprehensive model including data from all of the Cook County townships would produce better predictions due to the larger amount of data provided. At this point, I would not recommend that the assessor use my model.

References

[1]
“How Properties are Valued,” Cook County Assessor’s Office. Accessed: Feb. 17, 2024. [Online]. Available: https://www.cookcountyassessor.com/how-properties-are-valued