PA470 - Coding Warmup 3

Part 1

Exercise 7.2.3 from Data Science for Public Policy. Data can be found here.

Graph and regress sale price against gross square feet interpret the results

Code

library(tidyverse)
library(tidymodels)
library(scales)

sales <- read_csv("https://raw.githubusercontent.com/DataScienceForPublicPolicy/diys/main/data/home_sales_nyc.csv")

sales %>%
    ggplot(aes(x = gross.square.feet, y = sale.price)) +
        geom_point(alpha = 0.1, size = 1, color = "springgreen4") +
        geom_smooth(method = "lm", linewidth = 1, colour = "black") + 
        scale_x_continuous(labels = label_comma()) +
        scale_y_continuous(labels = label_currency()) +
        labs(title = "Sale Price and Gross Square Feet of House Sales in New York City ", 
            x = "Gross Square Feet", 
            y = "Sale Price")

Code

linear_reg() %>% 
    set_engine("lm") %>%
    set_mode("regression") %>%
    fit(sale.price ~ gross.square.feet, data = sales) %>%
    extract_fit_engine() %>%
    summary()


Call:
stats::lm(formula = sale.price ~ gross.square.feet, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1700116  -212264   -44958   138638  8661923 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -42584.389  11534.260  -3.692 0.000223 ***
gross.square.feet    466.176      7.097  65.684  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 463900 on 12666 degrees of freedom
Multiple R-squared:  0.2541,    Adjusted R-squared:  0.254 
F-statistic:  4314 on 1 and 12666 DF,  p-value: < 2.2e-16

Part 2

Reproduce this figure from tidymodels 3.3

Corr Plot 1

with the data from Part 1 replacing mpg with sale price for numeric variables.

Code

library(broom)

sales %>%
    select(where(is.numeric), -c(sale.price, borough, zip.code)) %>%
    map(function(col) cor.test(col, sales$sale.price)) %>%
    map_dfr(tidy, .id = "predictor") %>% 
    ggplot(aes(y = fct_reorder(predictor, estimate))) + 
        geom_point(aes(x = estimate)) + 
        geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = .1) +
        labs(title = "Correlation of Predictors with Sale Price", 
            x = "Correlation", 
            y = "Predictor")

Part 3

Exercise 7.4.5

Estimate a set of regressions, evaluate the pros and cons of each, and select the “best” specification.

Create and analyze the following four models from the textbook and one of your own:

Model 1 (mod1) regresses sales prices and building area
Model 2 (mod2) adds borough as a categorical variable
Model 3 (mod3) incorporates an interaction to estimate borough-specific slopes for building area
Model 4 (mod4) adds land area

This is obviously a very rudementary analysis, but it looks like model 4 has the lowest AIC and BIC of our 5 models. At a later point, we should conduct a more robust analysis of these models.

As with all models, the inclusion of certain predictors into our model requires some contemplation of the bias-variance tradeoff. If we were to simply analyse the RMSE of our models, we will always find that the model with all predictors will have the lowest RMSE. AIC on the other hand penalizes models with more variables and should give us a decent idea if the variance we may be adding is worth the bias we may be reducing.

Code

glm_spec <- linear_reg() %>% 
    set_engine("glm") %>%
    set_mode("regression")

mod1_rec <- recipe(sale.price ~ gross.square.feet, data = sales)

mod2_rec <- recipe(sale.price ~ gross.square.feet + borough, data = sales) %>%
    step_mutate(borough = factor(borough))

mod3_rec <- recipe(sale.price ~ gross.square.feet + borough, data = sales) %>%
    step_mutate(borough = factor(borough)) %>%
    step_interact(terms = ~ borough:gross.square.feet)

mod4_rec <- recipe(sale.price ~ gross.square.feet + borough + land.square.feet, data = sales) %>%
    step_mutate(borough = factor(borough)) %>%
    step_interact(terms = ~ borough:gross.square.feet)

mod5_rec <- recipe(sale.price ~ gross.square.feet + borough + age, data = sales) %>%
    step_mutate(borough = factor(borough))

list(mod1_rec, mod2_rec, mod3_rec, mod4_rec, mod5_rec) %>% 
    map(
        function (rec) {
            workflow() %>%
                add_recipe(rec) %>%
                add_model(glm_spec) %>%
                fit(data = sales) %>% 
                glance()
        }) %>%
    bind_rows(.id = "id") %>%
    mutate(id = str_c("mod", id)) %>%
    arrange(AIC)

# A tibble: 5 × 9
  id    null.deviance df.null   logLik     AIC    BIC deviance df.residual  nobs
  <chr>         <dbl>   <int>    <dbl>   <dbl>  <dbl>    <dbl>       <int> <int>
1 mod4        3.65e15   12667 -180884. 361791. 3.62e5  1.87e15       12657 12668
2 mod3        3.65e15   12667 -181071. 362164. 3.62e5  1.93e15       12658 12668
3 mod5        3.65e15   12667 -181709. 363433. 3.63e5  2.13e15       12661 12668
4 mod2        3.65e15   12667 -181710. 363434. 3.63e5  2.13e15       12662 12668
5 mod1        3.65e15   12667 -183258. 366522. 3.67e5  2.73e15       12666 12668

Code

workflow() %>%
    add_recipe(mod4_rec) %>%
    add_model(glm_spec) %>%
    fit(data = sales) %>%
    tidy()

# A tibble: 11 × 5
   term                           estimate std.error statistic  p.value
   <chr>                             <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                    720684.  191646.        3.76 1.70e- 4
 2 gross.square.feet                1034.      56.5      18.3  9.89e-74
 3 borough2                      -627822.  195168.       -3.22 1.30e- 3
 4 borough3                     -1072863.  192860.       -5.56 2.71e- 8
 5 borough4                      -589096.  192262.       -3.06 2.19e- 3
 6 borough5                      -557855.  192464.       -2.90 3.76e- 3
 7 land.square.feet                   35.7      1.83     19.5  1.63e-83
 8 borough2_x_gross.square.feet     -864.      60.5     -14.3  7.56e-46
 9 borough3_x_gross.square.feet     -292.      57.8      -5.04 4.70e- 7
10 borough4_x_gross.square.feet     -756.      57.5     -13.2  2.91e-39
11 borough5_x_gross.square.feet     -882.      57.7     -15.3  2.71e-52

Part 4

In the class divvy example (see the lectures page for code/files), we had a lot of missing values in our data. We also didn’t have a very rigorous treatment of time/seasonality. Explore how impactful these issues are by creating a few different models and comparing the predictions using the workflows we saw from class in rsample (splitting data), parsnip (linear_reg, set_engine, set_mode, fit), yardstick (mape, rmse), and broom (augment).

Due to time constraints, I chose not to do this problem rather that give a sub par response.