class: title-slide <br> <br> .pull-right[ # Variance ## Dr. Mine Dogucu ] --- class: inverse middle .font75[If there was no variance there would be no statistics.] --- ## What if? - We want to understand average number of sleep Irvine residents get. What if everyone in Irvine slept 8 hours every night? (`sleep` = {8, 8,..., 8}) -- - We want to predict who will graduate college. What if everyone graduated college? (`graduate` = {TRUE, TRUE,..., TRUE}) -- - We want to understand if Android users spend more time on their phones when compared to iOS users. What if everyone spent 3 hours per day on their phones? (`time` = {3, 3,..., 3}, `os` = {Android, Android, .... iOS}) -- - We want to understand, if birth height and weight are positively associated in babies. What if every baby was 7.5 lbs? (`weight` = {7.5, 7.5,..., 7.5}, `height` = {20, 22,..., 18}) --- class: middle In all this fake scenarios there would be no **variance** in `sleep`, `graduate`, `time`, `weight`. These variables would all be constants thus would not even be a **vari**able. --- class: middle .font50[Things vary. We use statistics to understand **how** things vary and often we want to know **why** they vary.] --- Consider Dr. Dogucu teaching three classes. All of these classes have 5 students. Below are midterm results from these classes. Which of the classes have a higher variance? Class 1: 80 80 80 80 80 Class 2: 76 78 80 82 84 Class 3: 60 70 80 90 100 -- All of these classes have a mean of 80 points. But the data differ! In order to explain how these are different we examine how far off each observed value is from the mean on "average". In class 1 all students are at the mean value so there is no variance. Class 2 students deviate from the mean slightly on "average". Class 3 has the highest deviation from the mean on "average". --- class: inverse middle .font75[Calculating the Variance] --- ## The Mean Consider the following data which represents the number of hours slept for 10 people who were surveyed. `\(y_i = \{7, 7.5, 8, 5.5, 10, 7.2, 7, 8, 9, 8\}\)` ```r y_i <- c(7, 7.5, 8, 5.5, 10, 7.2, 7, 8, 9, 8) ``` -- ```r mean(y_i) ``` ``` ## [1] 7.72 ``` `\(\bar y = 7.72 \text{ hr}\)` --- ## Sample Size ```r length(y_i) ``` ``` ## [1] 10 ``` `\(n = 10\)` -- `\(i = \{1, 2, ... n\}\)` -- `\(i = \{1, 2, ... 10\}\)` --- .pull-left[## Standard Deviation ```r sd(y_i) ``` ``` ## [1] 1.218195 ``` `\(s = 1.218195 \text{ hr}\)` ] .pull-right[## Variance ```r var(y_i) ``` ``` ## [1] 1.484 ``` `\(s^2 = 1.484 \text{ hr}^2\)` ] --- <br> <br> `\(n = 10\)`, `\(\bar y = 7.72 \text{ hr}\)`, `\(s = 1.218195 \text{ hr}\)` Among 10 people the average number of sleep was 7.72 hours. However, everybody did not sleep 7.72 hours. There was deviation from the mean. The standard deviation from the mean was 1.218195 hours. The variance is the squared standard deviation which was `\(1.484 \text{ hr}^2\)`. --- class: inverse middle .font50[How did R calculate the variance?] --- #### Standard deviation and Variance <table align = "center"> <thead> <th>y<sub>i</sub> </th> <th> y<sub>i</sub> - ȳ </th> <th> (y<sub>i</sub> - ȳ) <sup>2</sup></th> </thead> <tr> <td>5.5 </td> <td> 5.5-7.72 = -2.22 hr</td> <td> (-2.2 hr)<sup>2</sup> = 4.9284 hr <sup>2</sup> </td> </tr> <tr> <td>7 </td> <td> 7-7.72 = -0.72 hr</td> <td> (-0.72 hr)<sup>2</sup> = 0.5184 hr <sup>2</sup></td> </tr> <tr> <td>7 </td> <td> 7-7.72 = -0.72 hr</td> <td> (-0.72 hr)<sup>2</sup> = 0.5184 hr <sup>2</sup></td> </tr> <tr> <td>7.2 </td> <td> 7.2-7.72 = -0.52 hr</td> <td> (-0.52 hr)<sup>2</sup> = 0.2704 hr <sup>2</sup></td> </tr> <tr> <td>7.5 </td> <td> 7.5-7.72 = -0.22 hr</td> <td> (-0.22 hr)<sup>2</sup> = 0.0484 hr <sup>2</sup></td> </tr> <tr> <td>8 </td> <td> 8-7.72 = 0.28 hr</td> <td> (0.28 hr)<sup>2</sup> = 0.0784 hr <sup>2</sup></td> </tr> <tr> <td>8 </td> <td> 8-7.72 = 0.28 hr</td> <td> (0.28 hr)<sup>2</sup> = 0.0784 hr <sup>2</sup></td> </tr> <tr> <td>8 </td> <td> 8-7.72 = 0.28 hr</td> <td> (0.28 hr)<sup>2</sup> = 0.0784 hr <sup>2</sup></td> </tr> <tr> <td>9 </td> <td> 9-7.72 = 1.28 hr</td> <td> (1.28 hr)<sup>2</sup> = 1.6384 hr <sup>2</sup></td> </tr> <tr> <td>10 </td> <td> 10-7.72 = 2.28 hr</td> <td> (2.28 hr)<sup>2</sup> = 5.1984 hr <sup>2</sup></td> </tr> </table> --- ## Total distance from the mean `\(\Sigma_{i = 1}^{n} (y_i - \bar y )^2 =\)` `\(4.9284 + 0.5184 + 0.5184 + 0.2704 + 0.0484 +\)` `\(0.0784 + 0.0784 + 0.0784+ 1.6384 + 5.1984 = 13.356 \text{ hr}^2\)` --- ## Sample variance <br> .formula[ `$$s^2 = \frac{\Sigma_{i = 1}^{n} (y_i - \bar y )^2}{n-1}$$` ] <br> `$$s^2= \frac{13.356}{10-1} = 1.484\text{ hr}^2$$` --- ## Notation <table align ="center"> <thead> <th> </th> <th>Sample Statistic</th> <th>Population Parameter</th> </thead> <tr> <td>Mean</td> <td> ȳ </td> <td> μ</td> </tr> <tr> <td> Standard Deviation </td> <td> s </td> <td> σ</td> </tr> <tr> <td> Variance </td> <td> s<sup>2</sup> </td> <td> σ<sup>2</sup></td> </tr> </table> Sample size is denoted by `\(n\)`. --- <img src="07c-variance_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" />