PA470 - Final Project

Introduction

The following project is a replication attempt of the machine learning model described in Huynh et al. [1] predicting lead exposure from drinking water within the city of Chicago. Lead-contaminated drinking water at the block level was defined as a binary variable indicating whether the majority of tests within a block have at least 1 ppb lead concentration. According to their model, approximately 75% of blocks are estimated to have lead-contaminated drinking water. The greatest predictors of lead-contaminated drinking water were geographic areas, population at the block, and number of buildings.

Huynh et al. (and by extension, I) used the following data sources:

Data Sources

Source	Measure	Extent
City of Chicago Department of Water Management Lead Test Data	Consecutive lead tests (ppb)	Anonymized to the block
Census	Block FIPS	Block
	Population (#)	Block
	Race/ethnicity (#) AIAN Asian Black Hispanic White	Block
American Community Survey	Block FIPS	Block
	Block group FIPS	Block group
	Population (#)	Block group, tract
	Race/ethnicity (#) AIAN Asian Black Hispanic White	Block group, tract
	Housing units (#)	Block group
	Median house value ($)	Block group
	Upper house value ($)	Block group
	Lower house value ($)	Block group
	Median homeowner costs ($)	Block group
	Education (#) High school GED <1 year of college > 1 year of college Associate’s degree Bachelor’s degree Master’s degree Professional School Doctoral Degree	Block group
	Poverty (#)	Block group
	English-only speakers (#)	Block group
	Computer access (#)	Block group
	Internet access (#)	Block group
	Complete plumbing facilities (#)	Block group
	Vacant Housing (#)	Block group
	Owner-occupied (#)	Block group
	Children under 5 (#)	Block group
	Children under 10 (#)	Block group
	Children under 18 (#)	Block group
Chicago Building Footprints	Median building age (years)	Block
	Building (#)	Block
	Max age (years)	Block
	Mean age (years)	Block
	Built after 1986 (#)	Block
Chicago Health Atlas	Community area	Community area
	Lead poisoning rate (%)	Community area
	Historical lead poisoning rate (%)	Community area
	Economic diversity index	Tract
	Hardship index	Tract
	Social vulnerability index	Tract
	Major crime (#)	Tract
	Eviction rate (%)	Tract
	Fine particulate matter concentration ($\mu\text{g/m}^3$)	Tract
	Access to food (%)	Tract
	Uninsured rate (%)	Tract
	Per capita income ($)	Tract
	Cognitive difficulty (%)	Tract
	Disability (%)	Tract
	Crowded housing (%)	Tract
	Rent burdened (%)	Tract
	Vacant housing (%)	Tract

Data

Libraries

Code

library(bonsai)
library(corrr)
library(doMC)
library(finetune)
library(httr2)
library(jsonlite)
library(leaflet)
library(magrittr)
library(mice)
library(readxl)
library(sf)
library(themis)
library(tidycensus)
library(tidymodels)
library(tidyverse)
library(tigris)

Chicago Boundaries

Download

Download boundaries at various levels: the city, community areas, and census blocks.

Code

chicagoBoundaries <- st_read("https://data.cityofchicago.org/api/geospatial/qqq8-j68g?fourfour=qqq8-j68g&cacheBust=1712775952&date=20240411&accessType=DOWNLOAD&method=export&format=GeoJSON") %>%
    st_transform("EPSG:4269")

chicagoCommunityAreas <- st_read("https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON") %>%
    st_transform("EPSG:4269")

cookBlocks <- blocks(state = "IL", county = "Cook")

Process

Perform an intersection on census blocks to subset to those blocks that are within Chicago Boundaries. Mutate GEOIDs to create complete block, block group, and tract GEOIDS.

Code

chicagoBlocks <- cookBlocks %>%
    filter(st_intersects(., chicagoBoundaries, sparse = FALSE) %>% unlist()) %>%
    filter(!str_detect(TRACTCE20, "^9900")) %>%
    mutate(GEOID_blk = GEOID20) %>%
    mutate(GEOID_blkGrp = str_sub(GEOID_blk, 1, -4)) %>%
    mutate(GEOID_tract = str_sub(GEOID_blk, 1, -5)) %>% 
    select(starts_with("GEOID_"))

American Community Survey

Download

Get metadata of all the relevant American Community Survey data.

Code

census_metadata <- 
    bind_rows(
        load_variables(2022, "acs5"), 
        load_variables(2020, "pl"))

blk_vars <- c(
    race_blk_total = "P2_001N",
    race_blk_aianNH = "P2_007N",
    race_blk_asianNH = "P2_008N",
    race_blk_blackNH = "P2_006N",
    race_blk_hispanic = "P2_002N",
    race_blk_whiteNH = "P2_005N"
)

blkGrp_vars <- c(
    race_blkGrp_total = "B03002_001",
    race_blkGrp_aianNH = "B03002_005",
    race_blkGrp_asianNH = "B03002_006",
    race_blkGrp_blackNH = "B03002_004",
    race_blkGrp_hispanic = "B03002_012",
    race_blkGrp_whiteNH = "B03002_003",
    housingUnits = "B25001_001",
    lowerValue = "B25076_001",
    medValue = "B25077_001",
    upperValue = "B25078_001",
    homeownerCost = "B25088_001",
    education_total = "B15003_001",
    education_highSchool = "B15003_017",
    education_ged = "B15003_018",
    education_lt1yCollege = "B15003_019",
    education_mt1yCollege = "B15003_020",
    education_associate = "B15003_021",
    education_bachelor = "B15003_022",
    education_master = "B15003_023",
    education_professional = "B15003_024",
    education_doctorate = "B15003_025",
    poverty_total = "B17010_001",
    poverty_belowPovertyLvl = "B17010_002", 
    language_total = "B99162_001",
    language_onlyEnglish = "B99162_002",
    computer_total = "B28003_001",
    computer_hasAComp = "B28003_002",
    internet_total = "B28011_001",
    internet_noInternet = "B28011_008",
    plumbing_total = "B25047_001",
    plumbing_complete = "B25047_002",
    vacancy_total = "B25002_001",
    vacancy_vacant = "B25002_003",
    occupied_total = "B25003_001",
    occupied_owner = "B25003_002",
    age_total = "B01001_001",
    age_maleU5 = "B01001_003",
    age_male5to9 = "B01001_004",
    age_male10to14 = "B01001_005",
    age_male15to17 = "B01001_006",
    age_femaleU5 = "B01001_027",
    age_female5to9 = "B01001_028",
    age_female10to14 = "B01001_029",
    age_female15to17 = "B01001_030")

tract_vars <- c(
    race_tract_total = "B03002_001",
    race_tract_aianNH = "B03002_005",
    race_tract_asianNH = "B03002_006",
    race_tract_blackNH = "B03002_004",
    race_tract_hispanic = "B03002_012",
    race_tract_whiteNH = "B03002_003",
    native_total = "B05002_001",
    native_nativeBrn = "B05002_002")

select_vars_metadata <- census_metadata %>%
    filter(name %in% c(blk_vars, blkGrp_vars, tract_vars))

Download American Community Survey data for all the relevant variables

Code

blk_data <- get_decennial(geography = "block", variables = blk_vars, state = "IL", county = "Cook", output = "wide")
blkGrp_data <- get_acs(geography = "block group", variables = blkGrp_vars, state = "IL", county = "Cook", output = "wide") 
tract_data <- get_acs(geography = "tract", variables = tract_vars, state = "IL", county = "Cook", output = "wide")

Process

Join datasets and utilize multiple imputation by chained equations with random forest as the regression model. Calculate total youth demographics by adding male and female.

Code

chicagoACS <- blk_data %>%
    select(-NAME) %>%
    rename(GEOID_blk = GEOID) %>%
    mutate(GEOID_blkGrp = str_sub(GEOID_blk, 1, -4)) %>%
    mutate(GEOID_tract = str_sub(GEOID_blk, 1, -5)) %>%
    relocate(GEOID_blkGrp, GEOID_tract, .after = GEOID_blk) %>%
    left_join(
        blkGrp_data %>%
            select(-ends_with("M"), -NAME) %>%
            mice(meth = "rf", seed = 123) %>%
            complete() %>%
            as_tibble(),
        join_by(GEOID_blkGrp == GEOID)) %>%
    left_join(
        tract_data %>%
            select(-NAME), 
        join_by(GEOID_tract == GEOID)) %>%
    mutate(
        age_U5E = age_maleU5E + age_femaleU5E,
        age_5to9E = age_male5to9E + age_female5to9E,
        age_10to17E = age_male10to14E + age_female10to14E + age_male15to17E + age_male15to17E) %>%
    select(-contains("male"), -ends_with("M")) %>%
    right_join(
        chicagoBlocks %>%
            st_drop_geometry()
    )

Chicago Building Footprints

Download

Download Chicago Building Footprints dataset to identify the age and number of buildings.

Code

chicagoBuildings <- st_read("https://data.cityofchicago.org/api/geospatial/syp8-uezg?fourfour=syp8-uezg&cacheBust=1712775954&date=20240414&accessType=DOWNLOAD&method=export&format=GeoJSON") %>%
    st_transform("EPSG:4269")

Process

Remove buildings with empty geometries or empty street names. Spatial join buildings to the block. Summarize building characteristics at the block level.

Code

sf_use_s2(FALSE)

chicagoBuildingsImp <- chicagoBuildings %>%
    filter(!st_is_empty(.)) %>%
    filter(st_name1 != "") %>%
    mutate(year_built =
        case_when(
            year_built == 0 ~ NA,
            .default = as.integer(year_built)
        )      
    ) %>%
    select(year_built) %>%
    st_join(chicagoBlocks) %>%
    st_drop_geometry() %>%
    left_join(chicagoACS) %>%
    tibble() %>%
    mice(meth = "rf", seed = 234) %>%
    complete() %>%
    as_tibble()

chicagoBldBlk <- chicagoBuildingsImp %>%
    st_drop_geometry() %>%
    group_by(GEOID_blk) %>%
    summarise(
        bldAge_median = 2024 - median(year_built),
        bld_number = n(),
        bldAge_max = 2024 - min(year_built),
        bldAge_mean = 2024 - mean(year_built),
        bldAge_nAft86 = sum(year_built > 1986)
    )

Chicago DWM Lead Tests

Download

Download Department of Water Management lead testing data

Code

pb_test_path <- tempfile()
download.file("https://www.chicagowaterquality.org/DataFiles/wqContent/Results.xlsx", pb_test_path, mode="wb")

chicagoPbTest <- read_excel(path = pb_test_path, skip = 2, sheet = 1)

Process

Join testing data to the bulding with the same address (avoiding geocoding). This step does end up yielding a lower success rate of tying tests to blocks, but was a shortcut necessary to achieve this project within a more reasonable time frame.

Code

chicagoBuildingAddresses <- chicagoBuildings %>%
    filter(!st_is_empty(.)) %>%
    filter(st_name1 != "") %>%
    st_join(chicagoBlocks) %>%
    st_drop_geometry() %>%
    as_tibble() %>%
    arrange(t_add1) %>%
    mutate(ID = row_number()) %>%
    mutate(
        Address = str_pad(t_add1, 2, pad = "0"),
        Address = str_c(str_sub(Address, end = -3), "XX"),
        Address = str_c(Address, pre_dir1, st_name1, st_type1, sep = " "))

chicagoPbBlk <- chicagoPbTest %>%
    filter(!is.na(`Sample Date`)) %>%
    filter(`1st Draw` != "See Follow-Up Sequential Sampling Table for Results") %>%
    mutate(across(c(`1st Draw`, `2/3 Min`, `5 Min`), ~
        case_when(
            str_detect(.x, "<") ~ NA,
            .default = .x
        )
    )) %>%
    mutate(across(c(`1st Draw`, `2/3 Min`, `5 Min`), ~
        case_when(
            is.na(.x) ~ TRUE,
            .default = FALSE
        ),
        .names = "{.col}_lt1"
    )) %>%
    mutate(ID = 
        map_int(Address, ~
            chicagoBuildingAddresses %>%
                filter(Address == .x) %>%
                slice( (n() + (n() %% 2))/2) %>%
                pluck("ID") %>%
                {ifelse(length(.) == 0, NA, .)}
        )
    ) %>% 
    left_join(chicagoBuildingAddresses) %>%
    group_by(GEOID_blk) %>%
    summarize(PB_gt1Pct = 1 - mean(`1st Draw_lt1`), PB_nTests = n()) %>%
    mutate(PB_majoritygt1 = PB_gt1Pct >= 0.5)

Chicago Health Atlas

Download

Download relevant data at the tract and community area level.

Code

chicagoHAtract <- request("https://chicagohealthatlas.org/api/action/download/") %>%
    req_method("POST") %>%
    req_body_raw(r"[{"layer":"tract-2020","topics":"EDX-~2018-2022,HDX-~2018-2022,SVI-~2020,CZM-~2018-2022,EVR-~2018,PMC-~2023,LFA-~2019,UNS-~2018-2022,PCI-~2018-2022,DIV-~2018-2022,DIS-~2018-2022,HTJ-~2018-2022,RBU-~2018-2022,VAC-~2018-2022","state":"","within":"","regions":"","place_filters":"","errors":"se","population":false,"lat_long":false,"counties":false,"documentation":false,"format":"csv","insight":"","benchmark":false}]", "application/x-www-form-urlencoded") %>%
    req_perform() %>%
    resp_body_string() %>%
    {read_csv(., col_names = names(read_csv(., n_max = 0)), cols(GEOID = "c"), skip = 2)} %>%
    select(-Layer, -Name) %>%
    rename(GEOID_tract = GEOID) %>%
    drop_na(!ends_with("se"))


chicagoHAPbCA <- request("https://chicagohealthatlas.org/api/action/download/") %>%
    req_method("POST") %>%
    req_body_raw(r"[{"layer":"neighborhood","topics":"LDPP-~2023,LDPPH-~2016","state":"","within":"","regions":"","place_filters":"","errors":"se","population":false,"lat_long":false,"counties":false,"documentation":false,"format":"csv","insight":"","benchmark":false}]", "application/x-www-form-urlencoded") %>%
    req_perform() %>%
    resp_body_string() %>%
    {read_csv(., col_names = names(read_csv(., n_max = 0)), cols(GEOID = "c"), skip = 2)}

Spatially join census blocks to the community area Chicago Health Atlas data by joining block to those community areas that cover them the most. Join the tract data and drop the standard errors.

Code

chicagoHAblk <- chicagoBlocks %>%
    st_join(
        left_join(chicagoHAPbCA, chicagoCommunityAreas, join_by(GEOID == area_numbe)) %>%
            st_as_sf(), 
        join = st_covered_by, 
        largest = TRUE) %>%
    rename(GEOID_CA = GEOID) %>%
    left_join(chicagoHAtract) %>%
    st_drop_geometry() %>%
    as_tibble() %>%
    select(-Layer, -Name, -community, -area, -shape_area, -perimeter, -area_num_1, -comarea_id, -comarea, -shape_len) %>%
    drop_na(!ends_with("se"))

Chicago Service Line Inventory

Download

Not currently implemented due to time constraints. This an additional data source that I am contemplating using. I also felt bad about the prospect of sending many thousands of api calls to a server that probably would not be expecting it.

Code

resp <- building_blocks %>%
    st_drop_geometry() %>%
    as_tibble() %>%
    filter(year_built != 0) %>%
    slice(1:100) %>% 
    mutate(st_add = str_c(label_hous, pre_dir1, st_name1, st_type1, sep = " ")) %>%
    pull(st_add) %>% 
    map(~
        request("https://sli.chicagowaterquality.org/servicematerialsearch/api/") %>%
            req_method("POST") %>%
            req_body_raw(str_c(r"[{"type":"text","accNum":"-","addr":"]", .x, r"["}]"))
    ) %>%
    req_perform_parallel(on_error = "continue")

resp %>%
    resps_successes() %>%
    keep(map_lgl(., resp_has_body)) %>%
    resps_data(resp_body_json) %>%
    map(~ fromJSON(.x) %>% `[[`(1)) %>%
    map_df(bind_cols)

Joining and Conducting Final Processing

In the end, the following criteria was used to include blocks within the study:

Population greater than 0
Number of buildings greater than 0
Within tract included in the Chicago Health Atlas

This leaves us with:

33,004 total blocks
12,031 blocks with at least one prior DWM test
- 0.799 blocks have a majority of first draw tests with detected lead levels greater than 1 ppb

Code

pct_calculator <- function(data, category) {
    mutate(data, across(
        .cols = matches(str_c("^", category, "(.*)(?<!(total))E$"), perl = TRUE), 
        .fn = ~ .x / !!sym(str_c(category, "_totalE")),
        .names = "{.col}Pct")) %>%
    select(-matches(str_c("^", category, "(.*)(?<!(total))E$"), perl = TRUE))
}

data <- chicagoACS %>%
    left_join(chicagoPbBlk) %>%
    inner_join(chicagoBldBlk) %>%
    inner_join(chicagoHAblk) %>%
    select(-ends_with("se")) %>%
    relocate(GEOID_CA, PB_gt1Pct, PB_nTests, PB_majoritygt1, .after = GEOID_tract) %>%
    rename_with(.cols = starts_with("race_blk_"), .fn = ~ str_c(.x, "E")) %>%
    filter(race_blk_totalE > 0) %>%
    pct_calculator("race_blk") %>%
    pct_calculator("race_blkGrp") %>%
    pct_calculator("race_tract") %>%
    pct_calculator("education") %>%
    pct_calculator("poverty") %>%
    pct_calculator("computer") %>%
    pct_calculator("internet") %>%
    pct_calculator("plumbing") %>%
    pct_calculator("vacancy") %>%
    pct_calculator("occupied") %>%
    pct_calculator("age") %>%
    pct_calculator("language") %>%
    pct_calculator("native") %>%
    drop_na(-starts_with("PB")) %>%
    select(-matches("^(?!race)(.*)_totalE$", perl = TRUE)) %>%
    rename_with(.cols = starts_with("race"), ~ 
        str_replace(.x, "race", "population") %>%
            str_replace("_totalE", "E")
    )

data %>%
    summarize(across(everything(), ~ sum(is.na(.x)))) %>%
    View()

Exploratory Data Analysis

Tracts that test more tend to have a higher percent of water tests with greater than 1 PPB lead even when adjusting for population size.

Lead Positive Percentage by Number of Tests
Lead Positive Percentage by Number of Tests Per Person

Code

data %>%
    group_by(GEOID_tract) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE)/ n
    ) %>% 
    group_by(n, gt1pct) %>%
    summarize(nTract = n()) %>% 
    ggplot(aes(x = n, y = gt1pct, color = nTract, size = nTract)) +
        geom_point() +
        labs(
            title = "Number of Tests vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",
            x = "Tests (#)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_color_gradient(low = "steelblue1", high = "black") +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

Code

data %>%
    group_by(GEOID_tract) %>%
    summarize(
        nPer100People = 100 *sum(PB_nTests, na.rm = TRUE) / population_tractE[1],
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE)/  sum(PB_nTests, na.rm = TRUE)
    ) %>% 
    ggplot(aes(x = nPer100People, y = gt1pct)) +
        geom_point(color = "steelblue1") +
        labs(
            title = "Number of Tests per 100 Residents vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",
            x = "Tests Per 100 Residents (#)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

We can see that residents on the north side have disproportionately low positive lead rates when compared to that of the rest of Chicago. This comes as no suprise. A prior analysis published within the Guardian found that “nine of the top 10 zip codes with the largest percentages of high test results were neighborhoods with majorities of Black and Hispanic residents.” [2]

Code

data %>%
    group_by(GEOID_blkGrp) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE) / n
    ) %>% 
    left_join(block_groups(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_blkGrp == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", gt1pct)(gt1pct),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_blkGrp}</strong><br>Percent of Tests: {round(gt1pct,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", gt1pct),
            values = gt1pct,
            opacity = 0.7,
            title = "Tests with Greater than 1 PPB Lead <br> Detected (%) By Block Group",
            position = "topright",
            na.label = "Insufficient Data",
            labFormat = labelFormat(
                prefix = " ", suffix = " %",
                transform = function(x) 100 * x)
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))

Code

data %>%
    group_by(GEOID_tract) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE) / n
    ) %>% 
    left_join(tracts(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_tract == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", gt1pct)(gt1pct),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_tract}</strong><br>Percent of Tests: {round(gt1pct,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", gt1pct),
            values = gt1pct,
            opacity = 0.7,
            title = "Tests with Greater than 1 PPB Lead <br> Detected (%) By Tract",
            position = "topright",
            na.label = "Insufficient Data",
            labFormat = labelFormat(
                prefix = " ", suffix = " %",
                transform = function(x) 100 * x)
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))

Code

data %>%
    group_by(GEOID_CA) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE) / n
    ) %>% 
    left_join(chicagoCommunityAreas, join_by(GEOID_CA == area_numbe)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", gt1pct)(gt1pct),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{community}</strong><br>Percent of Tests: {round(gt1pct,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", gt1pct),
            values = gt1pct,
            opacity = 0.7,
            title = "Tests with Greater than 1 PPB Lead <br> Detected (%) By Community Area",
            position = "topright",
            na.label = "Insufficient Data",
            labFormat = labelFormat(
                prefix = " ", suffix = " %",
                transform = function(x) 100 * x)
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))

Those areas with the highest tests per population rates seem to fall within the community areas of Beverly, Hyde Park, Forest Glen, and Lincoln Square, but there doesn’t seem to be a clear discerable reason for the higher rates within these communities.

Tests Per Person by Block Group
Tests Per Person by Tract

Code

data %>%
    group_by(GEOID_blkGrp) %>%
    summarize(
        nPer100People = 100 *sum(PB_nTests, na.rm = TRUE) / population_blkGrpE[1]
    ) %>% 
    left_join(block_groups(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_blkGrp == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", nPer100People)(nPer100People),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_blkGrp}</strong><br>Tests per 100 People: {round(nPer100People,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", nPer100People),
            values = nPer100People,
            opacity = 0.7,
            title = "Tests per 100 People <br>By Block Group",
            position = "topright",
            na.label = "Insufficient Data"
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))

Code

data %>%
    group_by(GEOID_tract) %>%
    summarize(
        nPer100People = 100 *sum(PB_nTests, na.rm = TRUE) / population_tractE[1]
    ) %>% 
    left_join(tracts(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_tract == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", nPer100People)(nPer100People),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_tract}</strong><br>Tests per 100 People: {round(nPer100People,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", nPer100People),
            values = nPer100People,
            opacity = 0.7,
            title = "Tests per 100 People <br>By Tract",
            position = "topright",
            na.label = "Insufficient Data"
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))

When looking at the percentage of tests with greater than 1 PPB lead at the tract level, there doesn’t seem to be a clear association with mean building age.

It important to note that within the Chicago, city code mandated that lead service that “any pipe 2 in or less in diameter connecting a home to the water system had to be lead” [3] until the 1986 ammendment of the Safe Drinking Water Act prohibiting the use of non lead-free pipes, solder or flux in water systems intended for human consumption [4]. This ammendment stopped short of prohibiting those lead pipes already in the ground. This means that to this day Chicago still has over 400,000 lead service lines [5], which can quite accurately be identified by simply looking at the age of a home. We can see that when looking at the percentage of tests with greater than 1 PPB lead at the tract level, it seems to be almost bounded by the percent of buildings built before 1986.

Positive Test Percentage by Mean Building Age
Positive Test Percentage by Percent Built Before 1986

Code

data %>%
    group_by(GEOID_tract) %>%
    summarize(
        bldAge_mean = sum(bldAge_mean * bld_number, na.rm = TRUE)/sum(bld_number, na.rm = TRUE),
        PB_gt1Pct = sum(PB_gt1Pct * PB_nTests, na.rm = TRUE)/sum(PB_nTests, na.rm = TRUE)
    ) %>% 
    ggplot(aes(x = bldAge_mean, y = PB_gt1Pct)) +
        geom_point(color = "steelblue1") +
        labs(
            title = "Mean Building Age vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",
            x = "Mean Building Age (years)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

Code

data %>%
    group_by(GEOID_tract) %>%
    summarize(
        bldAge_nBfr86 = 1-((sum(bldAge_nAft86, na.rm = TRUE))/sum(bld_number, na.rm = TRUE)),
        PB_gt1Pct = sum(PB_gt1Pct * PB_nTests, na.rm = TRUE)/sum(PB_nTests, na.rm = TRUE)
    ) %>% 
    ggplot(aes(x = bldAge_nBfr86, y = PB_gt1Pct)) +
        geom_point(color = "steelblue1") +
        labs(
            title = "Percent of Buildings built after 1986 vs Percent of Tests with \nGreater than 1 PPB Lead Detected by Tract",
            x = "Buildings built after 1986 (%)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_x_continuous(labels = scales::percent) +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

There is seems to be a considerable amount of multicolinearity, so we should pick a model that accounts for that.

Code

data %>%
    select(where(is.numeric)) %>%
    select(where(function(x) sum(!is.na(x)) != 0)) %>%
    select(where(function(x) var(x, na.rm = TRUE) != 0)) %>% 
    correlate(method = "spearman") %>%
    autoplot()

Modeling

Assessing Various Models

Code

tidymodels_prefer()

Creating Test/Training Data

I subset the testing data to the to the community areas within the West Side to reduce the amount of time necessary to fit the models. I also will not fit a lightGBM model because of time constraints. I also set the number of tests at 5 for all blocks without testing data, so that the model will predict whether 3 or more out of the 5 theoretical tests would be positive for lead above 1 ppb.

Code

set.seed(345) 
model_fitting_data <- data %>%
    mutate(PB_nTests = 
        case_when(
            is.na(PB_nTests) ~ 5,
            .default = PB_nTests
        )
    ) %>%
    filter(!is.na(PB_majoritygt1)) %>%
    mutate(PB_majoritygt1 = as.factor(PB_majoritygt1)) %>%
    filter(GEOID_CA %in% as.character(23:31)) %>%
    select(-PB_gt1Pct, -GEOID_blk, -GEOID_blkGrp)

split <- initial_split(model_fitting_data, strata = PB_majoritygt1)
train <- training(split)
test  <- testing(split)

set.seed(456) 
resamples <- vfold_cv(train, v = 5, strata = PB_majoritygt1)

Creating Our Recipes

Code

default_recipe <- model_fitting_data %>%    
    recipe(PB_majoritygt1 ~ .) %>%   
    step_novel(all_nominal_predictors()) %>%   
    step_dummy(all_nominal_predictors()) %>%
    step_zv(all_predictors()) %>%
    step_smote(PB_majoritygt1)

normalized_recipe <- default_recipe %>%
    step_normalize(all_predictors())

Creating Our Models

Code

logistic_class_spec <- 
    logistic_reg(penalty = tune(), mixture = tune()) %>% 
    set_engine("glmnet") %>% 
    set_mode("classification")

# We're not going to use light GBM because it just takes too long to fit.
lightGBM_class_spec <- 
    boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), min_n = tune(), sample_size = tune(), trees = tune()) %>%
    set_engine("lightgbm") %>%
    set_mode("classification")

rf_class_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

Creating Workflow

Code

normalized <- workflow_set(
    preproc = list(normalized = normalized_recipe),
    models = list(logistic = logistic_class_spec))

no_pre_proc <- workflow_set(
    preproc = list(simple = default_recipe), 
    models = list(RF = rf_class_spec))

Code

all_workflows <- bind_rows(no_pre_proc, normalized)

Tune Model

I encountered some weird errors about conflicted function names, so I included some code to fix those issues.

Code

conflicted::conflicts_prefer(purrr::flatten)
conflicted::conflicts_prefer(purrr::set_names)

registerDoMC(cores = 8)

grid_results <- all_workflows %>%
  workflow_map(
    "tune_race_anova",
    seed = 567,
    resamples = resamples,
    grid = 30,
    control = control_race(
      save_pred = FALSE,
      save_workflow = FALSE,
      verbose = TRUE))

We can see that of our two models, the random forest model tends to yield the lowest AUC.

Code

autoplot(grid_results)

Tune Best Model

Select Best Model

A known bug in the tune_bayes function means that I have to pass the parameter range for mtry for random forest instead of it automatically finalizing that range.

Code

conflicted::conflicts_prefer(dplyr::filter)

best_wf <- grid_results %>% 
    rank_results() %>%
    filter(.metric == "roc_auc", rank == 1) %>%
    pull(wflow_id)

Code

if (best_wf == "simple_RF") {
    rf_param <- grid_results %>% 
        extract_workflow(best_wf) %>%
        extract_parameter_set_dials() %>%
        update(mtry = 
            default_recipe %>%
                prep() %>%
                bake(train) %>%
                finalize(mtry(), .))
}

bayes_results <- grid_results %>% 
    extract_workflow(best_wf) %>%
    tune_bayes(
        resamples = resamples,
        initial = 20,
        iter = 25,
        param_info = case_when(best_wf == "simple_RF" ~ rf_param, .default = NULL),
        control = control_bayes(verbose = TRUE)
    )

Let’s look at some of the iterations. Seems like we’re getting similar AUC values as within the study!

Code

bayes_results %>%
  show_best(metric = "roc_auc") %>% 
  knitr::kable()

mtry	min_n	.metric	.estimator	mean	n	std_err	.config	.iter
83	40	roc_auc	binary	0.8121914	5	0.0178703	Iter11	11
82	40	roc_auc	binary	0.8121807	5	0.0173749	Iter21	21
83	38	roc_auc	binary	0.8119675	5	0.0181213	Iter14	14
86	40	roc_auc	binary	0.8119672	5	0.0178646	Iter13	13
137	40	roc_auc	binary	0.8116515	5	0.0175249	Iter1	1

Code

final_fit <- grid_results %>% 
    extract_workflow(best_wf) %>%
    finalize_workflow(select_best(bayes_results, metric = "roc_auc")) %>% 
    last_fit(split = split)

Code

final_fit %>%
    collect_predictions() %>%
    roc_curve(PB_majoritygt1, .pred_FALSE) %>%
    autoplot()

In our final fit, we got a slightly lower AUC, but not so low that I would start to suspect overfitting. I am quite happy with this model considering I am only working with a fraction of the data.

Code

final_fit %>% 
    collect_metrics() %>%
    knitr::kable()

.metric	.estimator	.estimate	.config
accuracy	binary	0.7417840	Preprocessor1_Model1
roc_auc	binary	0.7957404	Preprocessor1_Model1

Due to the unbalanced nature of our data, the model is better at predicting blocks that have a majority of tests with greater than 1 ppb lead detected. Synthetic Minority Over-sampling TEchnique (SMOTE) was used within the preprocessing to try to help address this problem, but it seems that the problem remains.

Code

final_fit %>%
    collect_predictions() %>%
    conf_mat(PB_majoritygt1, .pred_class) %>%
    autoplot(type = "heatmap")

Unsurprisingly, our most important variable within our model is the number of homes built after 1986 and the ercent of children ages 1-5 with blood lead level at or above 5 micrograms per deciliter. Interestingly, the percent asian non-hispanic at the block, block group, and tract levels are all within the top ten most important variables. This would be something to look into in the future. The number of tests is also an important factor, which was also expected, although it muddies some of the interpretability of the predictions.

Code

vip_rf_spec <- rand_forest(mtry = tune(), trees = 1000, min_n = tune()) %>% 
    set_engine("ranger", importance = "impurity") %>% 
    set_mode("classification")

vip_rf <- finalize_model(vip_rf_spec, select_best(bayes_results, metric = "roc_auc"))  
vip_wf <- workflow() %>%   
  add_recipe(default_recipe) %>%   
  add_model(vip_rf) 

vip_rf_fit <- vip_wf %>%    
  fit(train)

vip_rf_fit %>% 
    extract_fit_parsnip() %>% 
    vip::vip(geom = "point")

Conclusion

All in all, I am pretty happy with the end product of this project. Although I was not able to create a model for the entire city or utilize lightGBM, it seems that my model does a similar job to the model published within the Huynh et al. even if it is within a more limited area. I am definitely interested in looking more into this data, especially considering the work I put into combining all these disparate data sources (hopefully, when I gain access to some high powered computing resources). Thank you for the semester!

References

[1]

B. Q. Huynh, E. T. Chin, and M. V. Kiang, “Estimated Childhood Lead Exposure From Drinking Water in Chicago,” JAMA pediatrics, p. e240133, Mar. 2024, doi: 10.1001/jamapediatrics.2024.0133.

[2]

E. McCormick, A. Uteuova, T. Moore, and T. M. with photographs by J. K. Davis, “Revealed: the ‘shocking’ levels of toxic lead in Chicago tap water,” The Guardian, Sep. 2022, Accessed: Apr. 27, 2024. [Online]. Available: https://www.theguardian.com/us-news/2022/sep/21/lead-contamination-chicago-tap-water-revealed

[3]

E. McCormick and A. Uteuova, “Profiting from poison: how the US lead industry knowingly created a water crisis,” The Guardian, Sep. 2022, Accessed: Apr. 28, 2024. [Online]. Available: https://www.theguardian.com/us-news/2022/sep/22/us-lead-industry-history-water-crisis

[4]

O. US EPA, “Use of Lead Free Pipes, Fittings, Fixtures, Solder, and Flux for Drinking Water.” Aug. 2015. Accessed: Apr. 27, 2024. [Online]. Available: https://www.epa.gov/sdwa/use-lead-free-pipes-fittings-fixtures-solder-and-flux-drinking-water

[5]

“Water Service Line Inventory.” Accessed: Apr. 28, 2024. [Online]. Available: https://sli.chicagowaterquality.org/