Model Evaluation and Validation

class: title-slide

<br>
<br>
.pull-right[

# Model Evaluation and Validation
## Dr. Mine Dogucu
]

---

```r
glimpse(babies)
```

```
## Rows: 1,236
## Columns: 8
## $ case      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144,…
## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, …
## $ parity    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30,…
## $ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63,…
## $ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, …
## $ smoke     <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1,…
```

---

```r
model_g <- lm(bwt ~ gestation, data = babies)
tidy(model_g)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -10.1      8.32       -1.21 2.27e- 1
## 2 gestation      0.464    0.0297     15.6  3.22e-50
```

---

```r
glance(model_g)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.166         0.166  16.7      244. 3.22e-50     1 -5175. 10356. 10371.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

---

```r
glance(model_g)$r.squared
```

```
## [1] 0.1663449
```

16% of the variation in birth weight is explained by gestation. Higher values of R squared is preferred.

---

---

```r
model_gsa <- lm(bwt ~ gestation + smoke + age, data = babies)
```

---

## Adjusted R-Squared

.pull-left[

```r
glance(model_g)
```

.pull-right[

```r
glance(model_gsa)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.212         0.210  16.2      108. 3.45e-62     3 -5089. 10187. 10213.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```
]

---

```r
babies %>% 
  add_predictions(model_g) %>% 
  add_residuals(model_g) %>% 
  select(bwt, pred, resid)
```

```
## # A tibble: 1,236 x 3
##      bwt  pred  resid
##    <int> <dbl>  <dbl>
##  1   120  122.  -1.79
##  2   113  121.  -7.86
##  3   128  119.   8.53
##  4   123   NA   NA   
##  5   108  121. -12.9 
##  6   136  123.  13.3 
##  7   138  103.  34.8 
##  8   132  104.  28.3 
##  9   120  124.  -4.11
## 10   143  129.  14.2 
## # … with 1,226 more rows
```

---

```r
babies %>% 
  add_predictions(model_gsa) %>% 
  add_residuals(model_gsa) %>% 
  select(bwt, pred, resid)
```

```
## # A tibble: 1,236 x 3
##      bwt  pred  resid
##    <int> <dbl>  <dbl>
##  1   120  125.  -4.80
##  2   113  125. -11.5 
##  3   128  115.  13.3 
##  4   123   NA   NA   
##  5   108  115.  -7.47
##  6   136  125.  10.5 
##  7   138  108.  30.4 
##  8   132  107.  25.0 
##  9   120  127.  -6.81
## 10   143  124.  19.2 
## # … with 1,226 more rows
```

---

## Root Mean Square Error

`$RMSE = \sqrt{\frac{\Sigma_{i=1}^n(y_i-\hat y_i)^2}{n}}$`
---

```r
modelr::rmse(model_gsa, babies)
```

```
## [1] 16.1687
```

```r
modelr::rmse(model_g, babies)
```

```
## [1] 16.6512
```

---

```r
model_full <- lm(bwt ~ gestation + parity + age +
                   height + weight + smoke, 
                 data = babies)
```

```r
modelr::rmse(model_full, babies)
```

```
## [1] 15.78198
```

```r
glance(model_full)$r.squared
```

```
## [1] 0.2579535
```

Can we keep adding all the variables and try to get an EXCELLENT model fit?

---

## Overfitting

- We are fitting the model to sample data.

- Our goal is to understand the population data.

- If we make our model too perfect for our sample data, the model may not perform as well with other sample data from the population.

- In this case we would be overfitting the data.

- We can use **model validation** techniques.

---

## Splitting the Data (Train vs. Test)

```r
set.seed(12345)
babies_split <- rsample::initial_split(babies) ## 75% to 25% split
```

```r
babies_train <- rsample::training(babies_split)
dim(babies_train)
```

```
## [1] 927   8
```

```r
babies_test <- rsample::testing(babies_split)
dim(babies_test)
```

```
## [1] 309   8
```

---

```r
model_gsa_train <- lm(bwt ~ gestation + smoke + age, data = babies_train)

model_gsa_test <- lm(bwt ~ gestation + smoke + age, data = babies_test)
```

---

.pull-left[

```r
glance(model_gsa_train)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.209         0.206  16.1      79.5 1.19e-45     3 -3811. 7632. 7656.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```
]

.pull-right[

```r
glance(model_gsa_test)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.229         0.221  16.4      29.6 8.62e-17     3 -1276. 2561. 2580.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```
]

---

.pull-left[

```r
modelr::rmse(model_gsa_train, babies_train)
```

```
## [1] 16.09232
```
]

.pull-right[

```r
modelr::rmse(model_gsa_test, babies_test)
```

```
## [1] 16.29977
```
]