Final Project

Final Project
Author

Ryan Zomorrodi

Published

April 27, 2024

Introduction

The following project is a replication attempt of the machine learning model described in Huynh et al. [1] predicting lead exposure from drinking water within the city of Chicago. Lead-contaminated drinking water at the block level was defined as a binary variable indicating whether the majority of tests within a block have at least 1 ppb lead concentration. According to their model, approximately 75% of blocks are estimated to have lead-contaminated drinking water. The greatest predictors of lead-contaminated drinking water were geographic areas, population at the block, and number of buildings.

Huynh et al. (and by extension, I) used the following data sources:

Data Sources
Source Measure Extent
City of Chicago Department of Water Management Lead Test Data Consecutive lead tests (ppb) Anonymized to the block
Census Block FIPS Block
Population (#) Block

Race/ethnicity (#)

  • AIAN

  • Asian

  • Black

  • Hispanic

  • White

Block
American Community Survey Block FIPS Block
Block group FIPS Block group
Population (#) Block group, tract

Race/ethnicity (#)

  • AIAN

  • Asian

  • Black

  • Hispanic

  • White

Block group, tract
Housing units (#) Block group
Median house value ($) Block group
Upper house value ($) Block group
Lower house value ($) Block group
Median homeowner costs ($) Block group

Education (#)

  • High school

  • GED

  • <1 year of college

  • > 1 year of college

  • Associate’s degree

  • Bachelor’s degree

  • Master’s degree

  • Professional School

  • Doctoral Degree

Block group
Poverty (#) Block group
English-only speakers (#) Block group
Computer access (#) Block group
Internet access (#) Block group
Complete plumbing facilities (#) Block group
Vacant Housing (#) Block group
Owner-occupied (#) Block group
Children under 5 (#) Block group
Children under 10 (#) Block group
Children under 18 (#) Block group
Chicago Building Footprints Median building age (years) Block
Building (#) Block
Max age (years) Block
Mean age (years) Block
Built after 1986 (#) Block
Chicago Health Atlas Community area Community area
Lead poisoning rate (%) Community area
Historical lead poisoning rate (%) Community area
Economic diversity index Tract
Hardship index Tract
Social vulnerability index Tract
Major crime (#) Tract
Eviction rate (%) Tract
Fine particulate matter concentration (\(\mu\text{g/m}^3\)) Tract
Access to food (%) Tract
Uninsured rate (%) Tract
Per capita income ($) Tract
Cognitive difficulty (%) Tract
Disability (%) Tract
Crowded housing (%) Tract
Rent burdened (%) Tract
Vacant housing (%) Tract

Data

Libraries

Code
library(bonsai)
library(corrr)
library(doMC)
library(finetune)
library(httr2)
library(jsonlite)
library(leaflet)
library(magrittr)
library(mice)
library(readxl)
library(sf)
library(themis)
library(tidycensus)
library(tidymodels)
library(tidyverse)
library(tigris)

Chicago Boundaries

Download

Download boundaries at various levels: the city, community areas, and census blocks.

Code
chicagoBoundaries <- st_read("https://data.cityofchicago.org/api/geospatial/qqq8-j68g?fourfour=qqq8-j68g&cacheBust=1712775952&date=20240411&accessType=DOWNLOAD&method=export&format=GeoJSON") %>%
    st_transform("EPSG:4269")

chicagoCommunityAreas <- st_read("https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON") %>%
    st_transform("EPSG:4269")

cookBlocks <- blocks(state = "IL", county = "Cook") 

Process

Perform an intersection on census blocks to subset to those blocks that are within Chicago Boundaries. Mutate GEOIDs to create complete block, block group, and tract GEOIDS.

Code
chicagoBlocks <- cookBlocks %>%
    filter(st_intersects(., chicagoBoundaries, sparse = FALSE) %>% unlist()) %>%
    filter(!str_detect(TRACTCE20, "^9900")) %>%
    mutate(GEOID_blk = GEOID20) %>%
    mutate(GEOID_blkGrp = str_sub(GEOID_blk, 1, -4)) %>%
    mutate(GEOID_tract = str_sub(GEOID_blk, 1, -5)) %>% 
    select(starts_with("GEOID_"))

American Community Survey

Download

Get metadata of all the relevant American Community Survey data.

Code
census_metadata <- 
    bind_rows(
        load_variables(2022, "acs5"), 
        load_variables(2020, "pl"))

blk_vars <- c(
    race_blk_total = "P2_001N",
    race_blk_aianNH = "P2_007N",
    race_blk_asianNH = "P2_008N",
    race_blk_blackNH = "P2_006N",
    race_blk_hispanic = "P2_002N",
    race_blk_whiteNH = "P2_005N"
)

blkGrp_vars <- c(
    race_blkGrp_total = "B03002_001",
    race_blkGrp_aianNH = "B03002_005",
    race_blkGrp_asianNH = "B03002_006",
    race_blkGrp_blackNH = "B03002_004",
    race_blkGrp_hispanic = "B03002_012",
    race_blkGrp_whiteNH = "B03002_003",
    housingUnits = "B25001_001",
    lowerValue = "B25076_001",
    medValue = "B25077_001",
    upperValue = "B25078_001",
    homeownerCost = "B25088_001",
    education_total = "B15003_001",
    education_highSchool = "B15003_017",
    education_ged = "B15003_018",
    education_lt1yCollege = "B15003_019",
    education_mt1yCollege = "B15003_020",
    education_associate = "B15003_021",
    education_bachelor = "B15003_022",
    education_master = "B15003_023",
    education_professional = "B15003_024",
    education_doctorate = "B15003_025",
    poverty_total = "B17010_001",
    poverty_belowPovertyLvl = "B17010_002", 
    language_total = "B99162_001",
    language_onlyEnglish = "B99162_002",
    computer_total = "B28003_001",
    computer_hasAComp = "B28003_002",
    internet_total = "B28011_001",
    internet_noInternet = "B28011_008",
    plumbing_total = "B25047_001",
    plumbing_complete = "B25047_002",
    vacancy_total = "B25002_001",
    vacancy_vacant = "B25002_003",
    occupied_total = "B25003_001",
    occupied_owner = "B25003_002",
    age_total = "B01001_001",
    age_maleU5 = "B01001_003",
    age_male5to9 = "B01001_004",
    age_male10to14 = "B01001_005",
    age_male15to17 = "B01001_006",
    age_femaleU5 = "B01001_027",
    age_female5to9 = "B01001_028",
    age_female10to14 = "B01001_029",
    age_female15to17 = "B01001_030")

tract_vars <- c(
    race_tract_total = "B03002_001",
    race_tract_aianNH = "B03002_005",
    race_tract_asianNH = "B03002_006",
    race_tract_blackNH = "B03002_004",
    race_tract_hispanic = "B03002_012",
    race_tract_whiteNH = "B03002_003",
    native_total = "B05002_001",
    native_nativeBrn = "B05002_002")

select_vars_metadata <- census_metadata %>%
    filter(name %in% c(blk_vars, blkGrp_vars, tract_vars))

Download American Community Survey data for all the relevant variables

Code
blk_data <- get_decennial(geography = "block", variables = blk_vars, state = "IL", county = "Cook", output = "wide")
blkGrp_data <- get_acs(geography = "block group", variables = blkGrp_vars, state = "IL", county = "Cook", output = "wide") 
tract_data <- get_acs(geography = "tract", variables = tract_vars, state = "IL", county = "Cook", output = "wide") 

Process

Join datasets and utilize multiple imputation by chained equations with random forest as the regression model. Calculate total youth demographics by adding male and female.

Code
chicagoACS <- blk_data %>%
    select(-NAME) %>%
    rename(GEOID_blk = GEOID) %>%
    mutate(GEOID_blkGrp = str_sub(GEOID_blk, 1, -4)) %>%
    mutate(GEOID_tract = str_sub(GEOID_blk, 1, -5)) %>%
    relocate(GEOID_blkGrp, GEOID_tract, .after = GEOID_blk) %>%
    left_join(
        blkGrp_data %>%
            select(-ends_with("M"), -NAME) %>%
            mice(meth = "rf", seed = 123) %>%
            complete() %>%
            as_tibble(),
        join_by(GEOID_blkGrp == GEOID)) %>%
    left_join(
        tract_data %>%
            select(-NAME), 
        join_by(GEOID_tract == GEOID)) %>%
    mutate(
        age_U5E = age_maleU5E + age_femaleU5E,
        age_5to9E = age_male5to9E + age_female5to9E,
        age_10to17E = age_male10to14E + age_female10to14E + age_male15to17E + age_male15to17E) %>%
    select(-contains("male"), -ends_with("M")) %>%
    right_join(
        chicagoBlocks %>%
            st_drop_geometry()
    )

Chicago Building Footprints

Download

Download Chicago Building Footprints dataset to identify the age and number of buildings.

Code
chicagoBuildings <- st_read("https://data.cityofchicago.org/api/geospatial/syp8-uezg?fourfour=syp8-uezg&cacheBust=1712775954&date=20240414&accessType=DOWNLOAD&method=export&format=GeoJSON") %>%
    st_transform("EPSG:4269")

Process

Remove buildings with empty geometries or empty street names. Spatial join buildings to the block. Summarize building characteristics at the block level.

Code
sf_use_s2(FALSE)

chicagoBuildingsImp <- chicagoBuildings %>%
    filter(!st_is_empty(.)) %>%
    filter(st_name1 != "") %>%
    mutate(year_built =
        case_when(
            year_built == 0 ~ NA,
            .default = as.integer(year_built)
        )      
    ) %>%
    select(year_built) %>%
    st_join(chicagoBlocks) %>%
    st_drop_geometry() %>%
    left_join(chicagoACS) %>%
    tibble() %>%
    mice(meth = "rf", seed = 234) %>%
    complete() %>%
    as_tibble()

chicagoBldBlk <- chicagoBuildingsImp %>%
    st_drop_geometry() %>%
    group_by(GEOID_blk) %>%
    summarise(
        bldAge_median = 2024 - median(year_built),
        bld_number = n(),
        bldAge_max = 2024 - min(year_built),
        bldAge_mean = 2024 - mean(year_built),
        bldAge_nAft86 = sum(year_built > 1986)
    )

Chicago DWM Lead Tests

Download

Download Department of Water Management lead testing data

Code
pb_test_path <- tempfile()
download.file("https://www.chicagowaterquality.org/DataFiles/wqContent/Results.xlsx", pb_test_path, mode="wb")

chicagoPbTest <- read_excel(path = pb_test_path, skip = 2, sheet = 1) 

Process

Join testing data to the bulding with the same address (avoiding geocoding). This step does end up yielding a lower success rate of tying tests to blocks, but was a shortcut necessary to achieve this project within a more reasonable time frame.

Code
chicagoBuildingAddresses <- chicagoBuildings %>%
    filter(!st_is_empty(.)) %>%
    filter(st_name1 != "") %>%
    st_join(chicagoBlocks) %>%
    st_drop_geometry() %>%
    as_tibble() %>%
    arrange(t_add1) %>%
    mutate(ID = row_number()) %>%
    mutate(
        Address = str_pad(t_add1, 2, pad = "0"),
        Address = str_c(str_sub(Address, end = -3), "XX"),
        Address = str_c(Address, pre_dir1, st_name1, st_type1, sep = " "))

chicagoPbBlk <- chicagoPbTest %>%
    filter(!is.na(`Sample Date`)) %>%
    filter(`1st Draw` != "See Follow-Up Sequential Sampling Table for Results") %>%
    mutate(across(c(`1st Draw`, `2/3 Min`, `5 Min`), ~
        case_when(
            str_detect(.x, "<") ~ NA,
            .default = .x
        )
    )) %>%
    mutate(across(c(`1st Draw`, `2/3 Min`, `5 Min`), ~
        case_when(
            is.na(.x) ~ TRUE,
            .default = FALSE
        ),
        .names = "{.col}_lt1"
    )) %>%
    mutate(ID = 
        map_int(Address, ~
            chicagoBuildingAddresses %>%
                filter(Address == .x) %>%
                slice( (n() + (n() %% 2))/2) %>%
                pluck("ID") %>%
                {ifelse(length(.) == 0, NA, .)}
        )
    ) %>% 
    left_join(chicagoBuildingAddresses) %>%
    group_by(GEOID_blk) %>%
    summarize(PB_gt1Pct = 1 - mean(`1st Draw_lt1`), PB_nTests = n()) %>%
    mutate(PB_majoritygt1 = PB_gt1Pct >= 0.5)

Chicago Health Atlas

Download

Download relevant data at the tract and community area level.

Code
chicagoHAtract <- request("https://chicagohealthatlas.org/api/action/download/") %>%
    req_method("POST") %>%
    req_body_raw(r"[{"layer":"tract-2020","topics":"EDX-~2018-2022,HDX-~2018-2022,SVI-~2020,CZM-~2018-2022,EVR-~2018,PMC-~2023,LFA-~2019,UNS-~2018-2022,PCI-~2018-2022,DIV-~2018-2022,DIS-~2018-2022,HTJ-~2018-2022,RBU-~2018-2022,VAC-~2018-2022","state":"","within":"","regions":"","place_filters":"","errors":"se","population":false,"lat_long":false,"counties":false,"documentation":false,"format":"csv","insight":"","benchmark":false}]", "application/x-www-form-urlencoded") %>%
    req_perform() %>%
    resp_body_string() %>%
    {read_csv(., col_names = names(read_csv(., n_max = 0)), cols(GEOID = "c"), skip = 2)} %>%
    select(-Layer, -Name) %>%
    rename(GEOID_tract = GEOID) %>%
    drop_na(!ends_with("se"))


chicagoHAPbCA <- request("https://chicagohealthatlas.org/api/action/download/") %>%
    req_method("POST") %>%
    req_body_raw(r"[{"layer":"neighborhood","topics":"LDPP-~2023,LDPPH-~2016","state":"","within":"","regions":"","place_filters":"","errors":"se","population":false,"lat_long":false,"counties":false,"documentation":false,"format":"csv","insight":"","benchmark":false}]", "application/x-www-form-urlencoded") %>%
    req_perform() %>%
    resp_body_string() %>%
    {read_csv(., col_names = names(read_csv(., n_max = 0)), cols(GEOID = "c"), skip = 2)}

Spatially join census blocks to the community area Chicago Health Atlas data by joining block to those community areas that cover them the most. Join the tract data and drop the standard errors.

Code
chicagoHAblk <- chicagoBlocks %>%
    st_join(
        left_join(chicagoHAPbCA, chicagoCommunityAreas, join_by(GEOID == area_numbe)) %>%
            st_as_sf(), 
        join = st_covered_by, 
        largest = TRUE) %>%
    rename(GEOID_CA = GEOID) %>%
    left_join(chicagoHAtract) %>%
    st_drop_geometry() %>%
    as_tibble() %>%
    select(-Layer, -Name, -community, -area, -shape_area, -perimeter, -area_num_1, -comarea_id, -comarea, -shape_len) %>%
    drop_na(!ends_with("se"))

Chicago Service Line Inventory

Download

Not currently implemented due to time constraints. This an additional data source that I am contemplating using. I also felt bad about the prospect of sending many thousands of api calls to a server that probably would not be expecting it.

Code
resp <- building_blocks %>%
    st_drop_geometry() %>%
    as_tibble() %>%
    filter(year_built != 0) %>%
    slice(1:100) %>% 
    mutate(st_add = str_c(label_hous, pre_dir1, st_name1, st_type1, sep = " ")) %>%
    pull(st_add) %>% 
    map(~
        request("https://sli.chicagowaterquality.org/servicematerialsearch/api/") %>%
            req_method("POST") %>%
            req_body_raw(str_c(r"[{"type":"text","accNum":"-","addr":"]", .x, r"["}]"))
    ) %>%
    req_perform_parallel(on_error = "continue")

resp %>%
    resps_successes() %>%
    keep(map_lgl(., resp_has_body)) %>%
    resps_data(resp_body_json) %>%
    map(~ fromJSON(.x) %>% `[[`(1)) %>%
    map_df(bind_cols)

Joining and Conducting Final Processing

In the end, the following criteria was used to include blocks within the study:

  • Population greater than 0
  • Number of buildings greater than 0
  • Within tract included in the Chicago Health Atlas

This leaves us with:

  • 33,004 total blocks
  • 12,031 blocks with at least one prior DWM test
    • 0.799 blocks have a majority of first draw tests with detected lead levels greater than 1 ppb
Code
pct_calculator <- function(data, category) {
    mutate(data, across(
        .cols = matches(str_c("^", category, "(.*)(?<!(total))E$"), perl = TRUE), 
        .fn = ~ .x / !!sym(str_c(category, "_totalE")),
        .names = "{.col}Pct")) %>%
    select(-matches(str_c("^", category, "(.*)(?<!(total))E$"), perl = TRUE))
}

data <- chicagoACS %>%
    left_join(chicagoPbBlk) %>%
    inner_join(chicagoBldBlk) %>%
    inner_join(chicagoHAblk) %>%
    select(-ends_with("se")) %>%
    relocate(GEOID_CA, PB_gt1Pct, PB_nTests, PB_majoritygt1, .after = GEOID_tract) %>%
    rename_with(.cols = starts_with("race_blk_"), .fn = ~ str_c(.x, "E")) %>%
    filter(race_blk_totalE > 0) %>%
    pct_calculator("race_blk") %>%
    pct_calculator("race_blkGrp") %>%
    pct_calculator("race_tract") %>%
    pct_calculator("education") %>%
    pct_calculator("poverty") %>%
    pct_calculator("computer") %>%
    pct_calculator("internet") %>%
    pct_calculator("plumbing") %>%
    pct_calculator("vacancy") %>%
    pct_calculator("occupied") %>%
    pct_calculator("age") %>%
    pct_calculator("language") %>%
    pct_calculator("native") %>%
    drop_na(-starts_with("PB")) %>%
    select(-matches("^(?!race)(.*)_totalE$", perl = TRUE)) %>%
    rename_with(.cols = starts_with("race"), ~ 
        str_replace(.x, "race", "population") %>%
            str_replace("_totalE", "E")
    )

data %>%
    summarize(across(everything(), ~ sum(is.na(.x)))) %>%
    View()

Exploratory Data Analysis

Tracts that test more tend to have a higher percent of water tests with greater than 1 PPB lead even when adjusting for population size.

Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE)/ n
    ) %>% 
    group_by(n, gt1pct) %>%
    summarize(nTract = n()) %>% 
    ggplot(aes(x = n, y = gt1pct, color = nTract, size = nTract)) +
        geom_point() +
        labs(
            title = "Number of Tests vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",
            x = "Tests (#)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_color_gradient(low = "steelblue1", high = "black") +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        nPer100People = 100 *sum(PB_nTests, na.rm = TRUE) / population_tractE[1],
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE)/  sum(PB_nTests, na.rm = TRUE)
    ) %>% 
    ggplot(aes(x = nPer100People, y = gt1pct)) +
        geom_point(color = "steelblue1") +
        labs(
            title = "Number of Tests per 100 Residents vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",
            x = "Tests Per 100 Residents (#)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

We can see that residents on the north side have disproportionately low positive lead rates when compared to that of the rest of Chicago. This comes as no suprise. A prior analysis published within the Guardian found that “nine of the top 10 zip codes with the largest percentages of high test results were neighborhoods with majorities of Black and Hispanic residents.” [2]

Code
data %>%
    group_by(GEOID_blkGrp) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE) / n
    ) %>% 
    left_join(block_groups(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_blkGrp == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", gt1pct)(gt1pct),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_blkGrp}</strong><br>Percent of Tests: {round(gt1pct,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", gt1pct),
            values = gt1pct,
            opacity = 0.7,
            title = "Tests with Greater than 1 PPB Lead <br> Detected (%) By Block Group",
            position = "topright",
            na.label = "Insufficient Data",
            labFormat = labelFormat(
                prefix = " ", suffix = " %",
                transform = function(x) 100 * x)
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))
Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE) / n
    ) %>% 
    left_join(tracts(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_tract == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", gt1pct)(gt1pct),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_tract}</strong><br>Percent of Tests: {round(gt1pct,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", gt1pct),
            values = gt1pct,
            opacity = 0.7,
            title = "Tests with Greater than 1 PPB Lead <br> Detected (%) By Tract",
            position = "topright",
            na.label = "Insufficient Data",
            labFormat = labelFormat(
                prefix = " ", suffix = " %",
                transform = function(x) 100 * x)
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))
Code
data %>%
    group_by(GEOID_CA) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE) / n
    ) %>% 
    left_join(chicagoCommunityAreas, join_by(GEOID_CA == area_numbe)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", gt1pct)(gt1pct),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{community}</strong><br>Percent of Tests: {round(gt1pct,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", gt1pct),
            values = gt1pct,
            opacity = 0.7,
            title = "Tests with Greater than 1 PPB Lead <br> Detected (%) By Community Area",
            position = "topright",
            na.label = "Insufficient Data",
            labFormat = labelFormat(
                prefix = " ", suffix = " %",
                transform = function(x) 100 * x)
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))

Those areas with the highest tests per population rates seem to fall within the community areas of Beverly, Hyde Park, Forest Glen, and Lincoln Square, but there doesn’t seem to be a clear discerable reason for the higher rates within these communities.

Code
data %>%
    group_by(GEOID_blkGrp) %>%
    summarize(
        nPer100People = 100 *sum(PB_nTests, na.rm = TRUE) / population_blkGrpE[1]
    ) %>% 
    left_join(block_groups(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_blkGrp == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", nPer100People)(nPer100People),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_blkGrp}</strong><br>Tests per 100 People: {round(nPer100People,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", nPer100People),
            values = nPer100People,
            opacity = 0.7,
            title = "Tests per 100 People <br>By Block Group",
            position = "topright",
            na.label = "Insufficient Data"
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))
Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        nPer100People = 100 *sum(PB_nTests, na.rm = TRUE) / population_tractE[1]
    ) %>% 
    left_join(tracts(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_tract == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", nPer100People)(nPer100People),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_tract}</strong><br>Tests per 100 People: {round(nPer100People,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", nPer100People),
            values = nPer100People,
            opacity = 0.7,
            title = "Tests per 100 People <br>By Tract",
            position = "topright",
            na.label = "Insufficient Data"
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))

When looking at the percentage of tests with greater than 1 PPB lead at the tract level, there doesn’t seem to be a clear association with mean building age.

It important to note that within the Chicago, city code mandated that lead service that “any pipe 2 in or less in diameter connecting a home to the water system had to be lead” [3] until the 1986 ammendment of the Safe Drinking Water Act prohibiting the use of non lead-free pipes, solder or flux in water systems intended for human consumption [4]. This ammendment stopped short of prohibiting those lead pipes already in the ground. This means that to this day Chicago still has over 400,000 lead service lines [5], which can quite accurately be identified by simply looking at the age of a home. We can see that when looking at the percentage of tests with greater than 1 PPB lead at the tract level, it seems to be almost bounded by the percent of buildings built before 1986.

Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        bldAge_mean = sum(bldAge_mean * bld_number, na.rm = TRUE)/sum(bld_number, na.rm = TRUE),
        PB_gt1Pct = sum(PB_gt1Pct * PB_nTests, na.rm = TRUE)/sum(PB_nTests, na.rm = TRUE)
    ) %>% 
    ggplot(aes(x = bldAge_mean, y = PB_gt1Pct)) +
        geom_point(color = "steelblue1") +
        labs(
            title = "Mean Building Age vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",
            x = "Mean Building Age (years)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        bldAge_nBfr86 = 1-((sum(bldAge_nAft86, na.rm = TRUE))/sum(bld_number, na.rm = TRUE)),
        PB_gt1Pct = sum(PB_gt1Pct * PB_nTests, na.rm = TRUE)/sum(PB_nTests, na.rm = TRUE)
    ) %>% 
    ggplot(aes(x = bldAge_nBfr86, y = PB_gt1Pct)) +
        geom_point(color = "steelblue1") +
        labs(
            title = "Percent of Buildings built after 1986 vs Percent of Tests with \nGreater than 1 PPB Lead Detected by Tract",
            x = "Buildings built after 1986 (%)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_x_continuous(labels = scales::percent) +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

There is seems to be a considerable amount of multicolinearity, so we should pick a model that accounts for that.

Code
data %>%
    select(where(is.numeric)) %>%
    select(where(function(x) sum(!is.na(x)) != 0)) %>%
    select(where(function(x) var(x, na.rm = TRUE) != 0)) %>% 
    correlate(method = "spearman") %>%
    autoplot()

Modeling

Assessing Various Models

Code
tidymodels_prefer()

Creating Test/Training Data

I subset the testing data to the to the community areas within the West Side to reduce the amount of time necessary to fit the models. I also will not fit a lightGBM model because of time constraints. I also set the number of tests at 5 for all blocks without testing data, so that the model will predict whether 3 or more out of the 5 theoretical tests would be positive for lead above 1 ppb.

Code
set.seed(345) 
model_fitting_data <- data %>%
    mutate(PB_nTests = 
        case_when(
            is.na(PB_nTests) ~ 5,
            .default = PB_nTests
        )
    ) %>%
    filter(!is.na(PB_majoritygt1)) %>%
    mutate(PB_majoritygt1 = as.factor(PB_majoritygt1)) %>%
    filter(GEOID_CA %in% as.character(23:31)) %>%
    select(-PB_gt1Pct, -GEOID_blk, -GEOID_blkGrp)

split <- initial_split(model_fitting_data, strata = PB_majoritygt1)
train <- training(split)
test  <- testing(split)

set.seed(456) 
resamples <- vfold_cv(train, v = 5, strata = PB_majoritygt1)

Creating Our Recipes

Code
default_recipe <- model_fitting_data %>%    
    recipe(PB_majoritygt1 ~ .) %>%   
    step_novel(all_nominal_predictors()) %>%   
    step_dummy(all_nominal_predictors()) %>%
    step_zv(all_predictors()) %>%
    step_smote(PB_majoritygt1)

normalized_recipe <- default_recipe %>%
    step_normalize(all_predictors())

Creating Our Models

Code
logistic_class_spec <- 
    logistic_reg(penalty = tune(), mixture = tune()) %>% 
    set_engine("glmnet") %>% 
    set_mode("classification")

# We're not going to use light GBM because it just takes too long to fit.
lightGBM_class_spec <- 
    boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), min_n = tune(), sample_size = tune(), trees = tune()) %>%
    set_engine("lightgbm") %>%
    set_mode("classification")

rf_class_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

Creating Workflow

Code
normalized <- workflow_set(
    preproc = list(normalized = normalized_recipe),
    models = list(logistic = logistic_class_spec))

no_pre_proc <- workflow_set(
    preproc = list(simple = default_recipe), 
    models = list(RF = rf_class_spec))
Code
all_workflows <- bind_rows(no_pre_proc, normalized)

Tune Model

I encountered some weird errors about conflicted function names, so I included some code to fix those issues.

Code
conflicted::conflicts_prefer(purrr::flatten)
conflicted::conflicts_prefer(purrr::set_names)

registerDoMC(cores = 8)

grid_results <- all_workflows %>%
  workflow_map(
    "tune_race_anova",
    seed = 567,
    resamples = resamples,
    grid = 30,
    control = control_race(
      save_pred = FALSE,
      save_workflow = FALSE,
      verbose = TRUE))

We can see that of our two models, the random forest model tends to yield the lowest AUC.

Code
autoplot(grid_results)

Tune Best Model

Select Best Model

A known bug in the tune_bayes function means that I have to pass the parameter range for mtry for random forest instead of it automatically finalizing that range.

Code
conflicted::conflicts_prefer(dplyr::filter)

best_wf <- grid_results %>% 
    rank_results() %>%
    filter(.metric == "roc_auc", rank == 1) %>%
    pull(wflow_id)
Code
if (best_wf == "simple_RF") {
    rf_param <- grid_results %>% 
        extract_workflow(best_wf) %>%
        extract_parameter_set_dials() %>%
        update(mtry = 
            default_recipe %>%
                prep() %>%
                bake(train) %>%
                finalize(mtry(), .))
}

bayes_results <- grid_results %>% 
    extract_workflow(best_wf) %>%
    tune_bayes(
        resamples = resamples,
        initial = 20,
        iter = 25,
        param_info = case_when(best_wf == "simple_RF" ~ rf_param, .default = NULL),
        control = control_bayes(verbose = TRUE)
    )

Let’s look at some of the iterations. Seems like we’re getting similar AUC values as within the study!

Code
bayes_results %>%
  show_best(metric = "roc_auc") %>% 
  knitr::kable()
mtry min_n .metric .estimator mean n std_err .config .iter
83 40 roc_auc binary 0.8121914 5 0.0178703 Iter11 11
82 40 roc_auc binary 0.8121807 5 0.0173749 Iter21 21
83 38 roc_auc binary 0.8119675 5 0.0181213 Iter14 14
86 40 roc_auc binary 0.8119672 5 0.0178646 Iter13 13
137 40 roc_auc binary 0.8116515 5 0.0175249 Iter1 1
Code
final_fit <- grid_results %>% 
    extract_workflow(best_wf) %>%
    finalize_workflow(select_best(bayes_results, metric = "roc_auc")) %>% 
    last_fit(split = split)
Code
final_fit %>%
    collect_predictions() %>%
    roc_curve(PB_majoritygt1, .pred_FALSE) %>%
    autoplot()

In our final fit, we got a slightly lower AUC, but not so low that I would start to suspect overfitting. I am quite happy with this model considering I am only working with a fraction of the data.

Code
final_fit %>% 
    collect_metrics() %>%
    knitr::kable()
.metric .estimator .estimate .config
accuracy binary 0.7417840 Preprocessor1_Model1
roc_auc binary 0.7957404 Preprocessor1_Model1

Due to the unbalanced nature of our data, the model is better at predicting blocks that have a majority of tests with greater than 1 ppb lead detected. Synthetic Minority Over-sampling TEchnique (SMOTE) was used within the preprocessing to try to help address this problem, but it seems that the problem remains.

Code
final_fit %>%
    collect_predictions() %>%
    conf_mat(PB_majoritygt1, .pred_class) %>%
    autoplot(type = "heatmap")

Unsurprisingly, our most important variable within our model is the number of homes built after 1986 and the ercent of children ages 1-5 with blood lead level at or above 5 micrograms per deciliter. Interestingly, the percent asian non-hispanic at the block, block group, and tract levels are all within the top ten most important variables. This would be something to look into in the future. The number of tests is also an important factor, which was also expected, although it muddies some of the interpretability of the predictions.

Code
vip_rf_spec <- rand_forest(mtry = tune(), trees = 1000, min_n = tune()) %>% 
    set_engine("ranger", importance = "impurity") %>% 
    set_mode("classification")

vip_rf <- finalize_model(vip_rf_spec, select_best(bayes_results, metric = "roc_auc"))  
vip_wf <- workflow() %>%   
  add_recipe(default_recipe) %>%   
  add_model(vip_rf) 

vip_rf_fit <- vip_wf %>%    
  fit(train)

vip_rf_fit %>% 
    extract_fit_parsnip() %>% 
    vip::vip(geom = "point")

Conclusion

All in all, I am pretty happy with the end product of this project. Although I was not able to create a model for the entire city or utilize lightGBM, it seems that my model does a similar job to the model published within the Huynh et al. even if it is within a more limited area. I am definitely interested in looking more into this data, especially considering the work I put into combining all these disparate data sources (hopefully, when I gain access to some high powered computing resources). Thank you for the semester!

References

[1]
B. Q. Huynh, E. T. Chin, and M. V. Kiang, “Estimated Childhood Lead Exposure From Drinking Water in Chicago,” JAMA pediatrics, p. e240133, Mar. 2024, doi: 10.1001/jamapediatrics.2024.0133.
[2]
E. McCormick, A. Uteuova, T. Moore, and T. M. with photographs by J. K. Davis, “Revealed: the ‘shocking’ levels of toxic lead in Chicago tap water,” The Guardian, Sep. 2022, Accessed: Apr. 27, 2024. [Online]. Available: https://www.theguardian.com/us-news/2022/sep/21/lead-contamination-chicago-tap-water-revealed
[3]
E. McCormick and A. Uteuova, “Profiting from poison: how the US lead industry knowingly created a water crisis,” The Guardian, Sep. 2022, Accessed: Apr. 28, 2024. [Online]. Available: https://www.theguardian.com/us-news/2022/sep/22/us-lead-industry-history-water-crisis
[4]
O. US EPA, “Use of Lead Free Pipes, Fittings, Fixtures, Solder, and Flux for Drinking Water.” Aug. 2015. Accessed: Apr. 27, 2024. [Online]. Available: https://www.epa.gov/sdwa/use-lead-free-pipes-fittings-fixtures-solder-and-flux-drinking-water
[5]
“Water Service Line Inventory.” Accessed: Apr. 28, 2024. [Online]. Available: https://sli.chicagowaterquality.org/