class: title-slide <br> <br> .pull-right[ # Model Evaluation and Validation ## Dr. Mine Dogucu ] --- ```r glimpse(babies) ``` ``` ## Rows: 1,236 ## Columns: 8 ## $ case <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, … ## $ bwt <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144,… ## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, … ## $ parity <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ age <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30,… ## $ height <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63,… ## $ weight <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, … ## $ smoke <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1,… ``` --- ```r model_g <- lm(bwt ~ gestation, data = babies) tidy(model_g) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -10.1 8.32 -1.21 2.27e- 1 ## 2 gestation 0.464 0.0297 15.6 3.22e-50 ``` --- ```r glance(model_g) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.166 0.166 16.7 244. 3.22e-50 1 -5175. 10356. 10371. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` --- ```r glance(model_g)$r.squared ``` ``` ## [1] 0.1663449 ``` 16% of the variation in birth weight is explained by gestation. Higher values of R squared is preferred. <img src="10a-model-eval_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- <img src="10a-model-eval_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ```r model_gsa <- lm(bwt ~ gestation + smoke + age, data = babies) ``` --- ## Adjusted R-Squared .pull-left[ ```r glance(model_g) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.166 0.166 16.7 244. 3.22e-50 1 -5175. 10356. 10371. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ] .pull-right[ ```r glance(model_gsa) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.212 0.210 16.2 108. 3.45e-62 3 -5089. 10187. 10213. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ] --- ```r babies %>% add_predictions(model_g) %>% add_residuals(model_g) %>% select(bwt, pred, resid) ``` ``` ## # A tibble: 1,236 x 3 ## bwt pred resid ## <int> <dbl> <dbl> ## 1 120 122. -1.79 ## 2 113 121. -7.86 ## 3 128 119. 8.53 ## 4 123 NA NA ## 5 108 121. -12.9 ## 6 136 123. 13.3 ## 7 138 103. 34.8 ## 8 132 104. 28.3 ## 9 120 124. -4.11 ## 10 143 129. 14.2 ## # … with 1,226 more rows ``` --- ```r babies %>% add_predictions(model_gsa) %>% add_residuals(model_gsa) %>% select(bwt, pred, resid) ``` ``` ## # A tibble: 1,236 x 3 ## bwt pred resid ## <int> <dbl> <dbl> ## 1 120 125. -4.80 ## 2 113 125. -11.5 ## 3 128 115. 13.3 ## 4 123 NA NA ## 5 108 115. -7.47 ## 6 136 125. 10.5 ## 7 138 108. 30.4 ## 8 132 107. 25.0 ## 9 120 127. -6.81 ## 10 143 124. 19.2 ## # … with 1,226 more rows ``` --- ## Root Mean Square Error `\(RMSE = \sqrt{\frac{\Sigma_{i=1}^n(y_i-\hat y_i)^2}{n}}\)` --- ```r modelr::rmse(model_gsa, babies) ``` ``` ## [1] 16.1687 ``` ```r modelr::rmse(model_g, babies) ``` ``` ## [1] 16.6512 ``` --- ```r model_full <- lm(bwt ~ gestation + parity + age + height + weight + smoke, data = babies) ``` ```r modelr::rmse(model_full, babies) ``` ``` ## [1] 15.78198 ``` ```r glance(model_full)$r.squared ``` ``` ## [1] 0.2579535 ``` Can we keep adding all the variables and try to get an EXCELLENT model fit? --- ## Overfitting - We are fitting the model to sample data. - Our goal is to understand the population data. - If we make our model too perfect for our sample data, the model may not perform as well with other sample data from the population. - In this case we would be overfitting the data. - We can use **model validation** techniques. --- ## Splitting the Data (Train vs. Test) ```r set.seed(12345) babies_split <- rsample::initial_split(babies) ## 75% to 25% split ``` -- ```r babies_train <- rsample::training(babies_split) dim(babies_train) ``` ``` ## [1] 927 8 ``` -- ```r babies_test <- rsample::testing(babies_split) dim(babies_test) ``` ``` ## [1] 309 8 ``` --- ```r model_gsa_train <- lm(bwt ~ gestation + smoke + age, data = babies_train) model_gsa_test <- lm(bwt ~ gestation + smoke + age, data = babies_test) ``` --- .pull-left[ ```r glance(model_gsa_train) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.209 0.206 16.1 79.5 1.19e-45 3 -3811. 7632. 7656. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ] .pull-right[ ```r glance(model_gsa_test) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.229 0.221 16.4 29.6 8.62e-17 3 -1276. 2561. 2580. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ] --- .pull-left[ ```r modelr::rmse(model_gsa_train, babies_train) ``` ``` ## [1] 16.09232 ``` ] .pull-right[ ```r modelr::rmse(model_gsa_test, babies_test) ``` ``` ## [1] 16.29977 ``` ]