The following project is a replication attempt of the machine learning model described in Huynh et al. [1] predicting lead exposure from drinking water within the city of Chicago. Lead-contaminated drinking water at the block level was defined as a binary variable indicating whether the majority of tests within a block have at least 1 ppb lead concentration. According to their model, approximately 75% of blocks are estimated to have lead-contaminated drinking water. The greatest predictors of lead-contaminated drinking water were geographic areas, population at the block, and number of buildings.
Huynh et al. (and by extension, I) used the following data sources:
Data Sources
Source
Measure
Extent
City of Chicago Department of Water Management Lead Test Data
Consecutive lead tests (ppb)
Anonymized to the block
Census
Block FIPS
Block
Population (#)
Block
Race/ethnicity (#)
AIAN
Asian
Black
Hispanic
White
Block
American Community Survey
Block FIPS
Block
Block group FIPS
Block group
Population (#)
Block group, tract
Race/ethnicity (#)
AIAN
Asian
Black
Hispanic
White
Block group, tract
Housing units (#)
Block group
Median house value ($)
Block group
Upper house value ($)
Block group
Lower house value ($)
Block group
Median homeowner costs ($)
Block group
Education (#)
High school
GED
<1 year of college
> 1 year of college
Associate’s degree
Bachelor’s degree
Master’s degree
Professional School
Doctoral Degree
Block group
Poverty (#)
Block group
English-only speakers (#)
Block group
Computer access (#)
Block group
Internet access (#)
Block group
Complete plumbing facilities (#)
Block group
Vacant Housing (#)
Block group
Owner-occupied (#)
Block group
Children under 5 (#)
Block group
Children under 10 (#)
Block group
Children under 18 (#)
Block group
Chicago Building Footprints
Median building age (years)
Block
Building (#)
Block
Max age (years)
Block
Mean age (years)
Block
Built after 1986 (#)
Block
Chicago Health Atlas
Community area
Community area
Lead poisoning rate (%)
Community area
Historical lead poisoning rate (%)
Community area
Economic diversity index
Tract
Hardship index
Tract
Social vulnerability index
Tract
Major crime (#)
Tract
Eviction rate (%)
Tract
Fine particulate matter concentration (\(\mu\text{g/m}^3\))
Download boundaries at various levels: the city, community areas, and census blocks.
Code
chicagoBoundaries <-st_read("https://data.cityofchicago.org/api/geospatial/qqq8-j68g?fourfour=qqq8-j68g&cacheBust=1712775952&date=20240411&accessType=DOWNLOAD&method=export&format=GeoJSON") %>%st_transform("EPSG:4269")chicagoCommunityAreas <-st_read("https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON") %>%st_transform("EPSG:4269")cookBlocks <-blocks(state ="IL", county ="Cook")
Process
Perform an intersection on census blocks to subset to those blocks that are within Chicago Boundaries. Mutate GEOIDs to create complete block, block group, and tract GEOIDS.
Download American Community Survey data for all the relevant variables
Code
blk_data <-get_decennial(geography ="block", variables = blk_vars, state ="IL", county ="Cook", output ="wide")blkGrp_data <-get_acs(geography ="block group", variables = blkGrp_vars, state ="IL", county ="Cook", output ="wide") tract_data <-get_acs(geography ="tract", variables = tract_vars, state ="IL", county ="Cook", output ="wide")
Process
Join datasets and utilize multiple imputation by chained equations with random forest as the regression model. Calculate total youth demographics by adding male and female.
Remove buildings with empty geometries or empty street names. Spatial join buildings to the block. Summarize building characteristics at the block level.
Join testing data to the bulding with the same address (avoiding geocoding). This step does end up yielding a lower success rate of tying tests to blocks, but was a shortcut necessary to achieve this project within a more reasonable time frame.
Spatially join census blocks to the community area Chicago Health Atlas data by joining block to those community areas that cover them the most. Join the tract data and drop the standard errors.
Not currently implemented due to time constraints. This an additional data source that I am contemplating using. I also felt bad about the prospect of sending many thousands of api calls to a server that probably would not be expecting it.
data %>%group_by(GEOID_tract) %>%summarize(n =sum(PB_nTests, na.rm =TRUE),gt1pct =sum(PB_nTests * PB_gt1Pct, na.rm =TRUE)/ n ) %>%group_by(n, gt1pct) %>%summarize(nTract =n()) %>%ggplot(aes(x = n, y = gt1pct, color = nTract, size = nTract)) +geom_point() +labs(title ="Number of Tests vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",x ="Tests (#)",y ="Tests with Greater than 1 PPB Lead Detected (%)") +scale_color_gradient(low ="steelblue1", high ="black") +scale_y_continuous(labels = scales::percent) +theme(plot.title =element_text(hjust =0.5),legend.position="none" )
Code
data %>%group_by(GEOID_tract) %>%summarize(nPer100People =100*sum(PB_nTests, na.rm =TRUE) / population_tractE[1],gt1pct =sum(PB_nTests * PB_gt1Pct, na.rm =TRUE)/sum(PB_nTests, na.rm =TRUE) ) %>%ggplot(aes(x = nPer100People, y = gt1pct)) +geom_point(color ="steelblue1") +labs(title ="Number of Tests per 100 Residents vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",x ="Tests Per 100 Residents (#)",y ="Tests with Greater than 1 PPB Lead Detected (%)") +scale_y_continuous(labels = scales::percent) +theme(plot.title =element_text(hjust =0.5),legend.position="none" )
We can see that residents on the north side have disproportionately low positive lead rates when compared to that of the rest of Chicago. This comes as no suprise. A prior analysis published within the Guardian found that “nine of the top 10 zip codes with the largest percentages of high test results were neighborhoods with majorities of Black and Hispanic residents.” [2]
Those areas with the highest tests per population rates seem to fall within the community areas of Beverly, Hyde Park, Forest Glen, and Lincoln Square, but there doesn’t seem to be a clear discerable reason for the higher rates within these communities.
When looking at the percentage of tests with greater than 1 PPB lead at the tract level, there doesn’t seem to be a clear association with mean building age.
It important to note that within the Chicago, city code mandated that lead service that “any pipe 2 in or less in diameter connecting a home to the water system had to be lead” [3] until the 1986 ammendment of the Safe Drinking Water Act prohibiting the use of non lead-free pipes, solder or flux in water systems intended for human consumption [4]. This ammendment stopped short of prohibiting those lead pipes already in the ground. This means that to this day Chicago still has over 400,000 lead service lines [5], which can quite accurately be identified by simply looking at the age of a home. We can see that when looking at the percentage of tests with greater than 1 PPB lead at the tract level, it seems to be almost bounded by the percent of buildings built before 1986.
data %>%group_by(GEOID_tract) %>%summarize(bldAge_mean =sum(bldAge_mean * bld_number, na.rm =TRUE)/sum(bld_number, na.rm =TRUE),PB_gt1Pct =sum(PB_gt1Pct * PB_nTests, na.rm =TRUE)/sum(PB_nTests, na.rm =TRUE) ) %>%ggplot(aes(x = bldAge_mean, y = PB_gt1Pct)) +geom_point(color ="steelblue1") +labs(title ="Mean Building Age vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",x ="Mean Building Age (years)",y ="Tests with Greater than 1 PPB Lead Detected (%)") +scale_y_continuous(labels = scales::percent) +theme(plot.title =element_text(hjust =0.5),legend.position="none" )
Code
data %>%group_by(GEOID_tract) %>%summarize(bldAge_nBfr86 =1-((sum(bldAge_nAft86, na.rm =TRUE))/sum(bld_number, na.rm =TRUE)),PB_gt1Pct =sum(PB_gt1Pct * PB_nTests, na.rm =TRUE)/sum(PB_nTests, na.rm =TRUE) ) %>%ggplot(aes(x = bldAge_nBfr86, y = PB_gt1Pct)) +geom_point(color ="steelblue1") +labs(title ="Percent of Buildings built after 1986 vs Percent of Tests with \nGreater than 1 PPB Lead Detected by Tract",x ="Buildings built after 1986 (%)",y ="Tests with Greater than 1 PPB Lead Detected (%)") +scale_x_continuous(labels = scales::percent) +scale_y_continuous(labels = scales::percent) +theme(plot.title =element_text(hjust =0.5),legend.position="none" )
There is seems to be a considerable amount of multicolinearity, so we should pick a model that accounts for that.
I subset the testing data to the to the community areas within the West Side to reduce the amount of time necessary to fit the models. I also will not fit a lightGBM model because of time constraints. I also set the number of tests at 5 for all blocks without testing data, so that the model will predict whether 3 or more out of the 5 theoretical tests would be positive for lead above 1 ppb.
logistic_class_spec <-logistic_reg(penalty =tune(), mixture =tune()) %>%set_engine("glmnet") %>%set_mode("classification")# We're not going to use light GBM because it just takes too long to fit.lightGBM_class_spec <-boost_tree(tree_depth =tune(), learn_rate =tune(), loss_reduction =tune(), min_n =tune(), sample_size =tune(), trees =tune()) %>%set_engine("lightgbm") %>%set_mode("classification")rf_class_spec <-rand_forest(mtry =tune(), min_n =tune(), trees =1000) %>%set_engine("ranger") %>%set_mode("classification")
We can see that of our two models, the random forest model tends to yield the lowest AUC.
Code
autoplot(grid_results)
Tune Best Model
Select Best Model
A known bug in the tune_bayes function means that I have to pass the parameter range for mtry for random forest instead of it automatically finalizing that range.
In our final fit, we got a slightly lower AUC, but not so low that I would start to suspect overfitting. I am quite happy with this model considering I am only working with a fraction of the data.
Code
final_fit %>%collect_metrics() %>% knitr::kable()
.metric
.estimator
.estimate
.config
accuracy
binary
0.7417840
Preprocessor1_Model1
roc_auc
binary
0.7957404
Preprocessor1_Model1
Due to the unbalanced nature of our data, the model is better at predicting blocks that have a majority of tests with greater than 1 ppb lead detected. Synthetic Minority Over-sampling TEchnique (SMOTE) was used within the preprocessing to try to help address this problem, but it seems that the problem remains.
Unsurprisingly, our most important variable within our model is the number of homes built after 1986 and the ercent of children ages 1-5 with blood lead level at or above 5 micrograms per deciliter. Interestingly, the percent asian non-hispanic at the block, block group, and tract levels are all within the top ten most important variables. This would be something to look into in the future. The number of tests is also an important factor, which was also expected, although it muddies some of the interpretability of the predictions.
All in all, I am pretty happy with the end product of this project. Although I was not able to create a model for the entire city or utilize lightGBM, it seems that my model does a similar job to the model published within the Huynh et al. even if it is within a more limited area. I am definitely interested in looking more into this data, especially considering the work I put into combining all these disparate data sources (hopefully, when I gain access to some high powered computing resources). Thank you for the semester!
References
[1]
B. Q. Huynh, E. T. Chin, and M. V. Kiang, “Estimated Childhood Lead Exposure From Drinking Water in Chicago,”JAMA pediatrics, p. e240133, Mar. 2024, doi: 10.1001/jamapediatrics.2024.0133.