Exercise 7.2.3 from Data Science for Public Policy. Data can be found here.
Graph and regress sale price against gross square feet interpret the results
Code
library(tidyverse)library(tidymodels)library(scales)sales <-read_csv("https://raw.githubusercontent.com/DataScienceForPublicPolicy/diys/main/data/home_sales_nyc.csv")sales %>%ggplot(aes(x = gross.square.feet, y = sale.price)) +geom_point(alpha =0.1, size =1, color ="springgreen4") +geom_smooth(method ="lm", linewidth =1, colour ="black") +scale_x_continuous(labels =label_comma()) +scale_y_continuous(labels =label_currency()) +labs(title ="Sale Price and Gross Square Feet of House Sales in New York City ", x ="Gross Square Feet", y ="Sale Price")
with the data from Part 1 replacing mpg with sale price for numeric variables.
Code
library(broom)sales %>%select(where(is.numeric), -c(sale.price, borough, zip.code)) %>%map(function(col) cor.test(col, sales$sale.price)) %>%map_dfr(tidy, .id ="predictor") %>%ggplot(aes(y =fct_reorder(predictor, estimate))) +geom_point(aes(x = estimate)) +geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = .1) +labs(title ="Correlation of Predictors with Sale Price", x ="Correlation", y ="Predictor")
Part 3
Exercise 7.4.5
Estimate a set of regressions, evaluate the pros and cons of each, and select the “best” specification.
Create and analyze the following four models from the textbook and one of your own:
Model 1 (mod1) regresses sales prices and building area
Model 2 (mod2) adds borough as a categorical variable
Model 3 (mod3) incorporates an interaction to estimate borough-specific slopes for building area
Model 4 (mod4) adds land area
This is obviously a very rudementary analysis, but it looks like model 4 has the lowest AIC and BIC of our 5 models. At a later point, we should conduct a more robust analysis of these models.
As with all models, the inclusion of certain predictors into our model requires some contemplation of the bias-variance tradeoff. If we were to simply analyse the RMSE of our models, we will always find that the model with all predictors will have the lowest RMSE. AIC on the other hand penalizes models with more variables and should give us a decent idea if the variance we may be adding is worth the bias we may be reducing.
In the class divvy example (see the lectures page for code/files), we had a lot of missing values in our data. We also didn’t have a very rigorous treatment of time/seasonality. Explore how impactful these issues are by creating a few different models and comparing the predictions using the workflows we saw from class in rsample (splitting data), parsnip (linear_reg, set_engine, set_mode, fit), yardstick (mape, rmse), and broom (augment).
Due to time constraints, I chose not to do this problem rather that give a sub par response.