---
title: "extra_practice"
author: "sbsambado"
date: "2022-08-19"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
At this point in the quarter, you know several ways of analyzing the relationship between two variables - that is, whether one variable is *predicted by* another. We learned last week that a linear model with one predictor variable can be written as $$Y_i = \beta_0+\beta_1X_i+\epsilon_i$$ where $Y_i$ is the value of your response variable, $X_i$ is the value of your predictor variable, and $\epsilon_i$ is the value of the residual for that point. 

In real life, however, one predictor can rarely explain all of the patterns in your response variable. More often, a response can be predicted by a combination of predictors. For example, your height may be predicted by some combination of your parents' height, your diet and nutrition, your age, and other factors.
We write these multivariate linear models as 
$$Y = \beta_0+\beta_1X_1+ \beta_2X_2 + ... + \beta_nX_n+\epsilon$$

Our hope is that by adding predictors, we will be able to explain more and more of the variability in the response variable in the model, increasing our $R^2$ value and decreasing the size of the residuals. However, each additional predictor adds another coefficient that must be estimated to fit our model. Estimating these extra parameters decreases the degrees of freedom and decreases the power of the analysis. For this reason, you can never have more predictors than you have data points (i.e. your dataset should be longer than it is wide). You will get the best model performance when you have *many* more data points than predictors.

When assessing the fit of a multiple regression model, we always use the *adjusted $R^2$* value instead of the regular $R^2$. The adjustment takes into account how many parameters have been estimated to build the model, as well as the sample size. 


# HOMEWORK

Using the airquality data in R, find a best model to predict ozone levels and assess how well your model performs. Treat this lab as a warm-up for your final project: make it look as neat and professional as you can.

1. Provide some background: What is ozone? Why might some of these variables affect its levels? Where does this data come from? 3-4 sentences is fine. Please cite any external sources you consult.

2. Read in the air quality data and visualize it. What are your potential predictors? What are some potential random factors? Do you think any of these predictors might interact with each other? Remove any missing data to 'clean up' the dataset.
```{r}
airqual <- airquality
write.csv(airquality, file = "airquality.csv")
```

3. Check your data for collinearity, and check whether predictors have a linear relationship with the response variable. Transform any data that violates assumptions (remember, not all predictors need to be normal as long as the model residuals are normal). Check for any major outliers. Describe what you find.

4. Fit a model with one predictor. Check its residuals for normality.

5. Fit at least two other models with different combinations of predictors. State whether you are including random effects or interaction terms.

6. Create a table of results to compare your models using AIC, BIC, and adjusted R-squared. Include the degrees of freedom of each model in the table so that you can consider parsimony in your final decision. *Note: mixed-effects models (models made with `lme()`) may not work in your table if you are comparing models made with `lm()`. Recommend making a table for models just made with `lm()`.

7. Choose the best model and explain why this was your choice. Check its residuals for normality. How well does this model fit the data? How well does it predict ozone levels?



### a. Linear models

We start by looking at the linear models for a few combinations of parameters, and calculating their AIC values.

```{r}
# all predictors
peng_lm1 <- lm(body_mass_g ~ flipper_length_mm + bill_depth_mm + bill_length_mm + sex + species + island + year, data = penguins)
AIC(peng_lm1)

# 5 predictors
peng_lm2 <- lm(body_mass_g ~ flipper_length_mm + bill_depth_mm + bill_length_mm + sex + species, data = penguins)
AIC(peng_lm2)

# 4 predictors
peng_lm3 <- lm(body_mass_g ~ flipper_length_mm + bill_depth_mm + sex + species, data = penguins)
AIC(peng_lm3)
```

### b. Choosing between models

Between these models, the best is `peng_lm2`, which has the lowest AIC value of `r round(AIC(peng_lm2), 2)`. We can calculate the difference between AIC values using
$$
\Delta_i = AIC_i - AIC_{min}
$$
where $AIC_{min}$ is the smallest AIC value being compared (in this case, `r round(AIC(peng_lm2), 2)`), and $AIC_i$ is the AIC value for the model you want to compare (in this case, the AIC value for `peng_lm3` is `r round(AIC(peng_lm3), 2)`). Therefore, the $\Delta_i$ is `r round(AIC(peng_lm3)-AIC(peng_lm2), 2)`.

However, the AIC value for `peng_lm1` is pretty close to the AIC value for `peng_lm2`. If you have two models with similar AIC values, people generally use this rule to evaluate how much those models support your data _together_:
```{r echo = FALSE}
support <- tribble(
  ~Delta, ~'Level of empirical support for Model i',
  '0-2', "Substantial",
  '4-7', "Considerably less",
  '>10', "Essentially none"
) %>% 
  gt() %>% 
  tab_style(
    style = cell_text(size = "small"),
    locations = cells_body())

support
```

### c. Relative support for each model
If we calculate $\Delta_i$ and $w_i$, we can compare the models.

```{r}
# calculate deltai2
delta1 <- exp(-0.5*(AIC(peng_lm1) - AIC(peng_lm2)))
delta2 <- exp(-0.5*(AIC(peng_lm2) - AIC(peng_lm2)))
delta3 <- exp(-0.5*(AIC(peng_lm3) - AIC(peng_lm2)))

# sum likelihoods
sum_likes <- sum(c(delta1, delta2, delta3))

# calculate weights
weight1 <- delta1/sum_likes
weight2 <- delta2/sum_likes
weight3 <- delta3/sum_likes
```

```{r echo = FALSE}
weights <- tribble(
  ~Model,   ~Delta,           ~Weights,
  "peng_lm1", round(delta1, 2), round(weight1, 2),
  "peng_lm2", round(delta2, 2), round(weight2, 2),
  "peng_lm3", round(delta3, 2), round(weight3, 2),
) %>% 
  gt() %>% 
  tab_style(
    style = cell_text(size = "small"),
    locations = cells_body())

weights
```


#### a. Preliminary predictors
```{r}
pairs(~ cover + DOP + DOY+  Wtemp, data=Salamanders)
```

#### b. Choosing the best model
```{r}
big.model <- glm(count ~ site + sample + cover + mined + DOP + DOY + Wtemp, data=Salamanders, family="poisson", na.action = na.pass)
#na.action=na.pass is necessary for the building-all-models function
```

For this, we use the package `MuMIn` which is useful for excluding/including each predictor when you're dealing with a model that has many predictors. 

```{r}
library(MuMIn) #install.packages("MuMIn")
```


```{r echo=FALSE}
sal.dredge <- dredge(big.model)
#this makes all the possible models that are subsets of big.model
```

Edwards suggests three options. 

A) Go off the best (top) model and state that DOY, sample, and site are all important, report effect sizes from this model. 
*but, if we look at the table we see that there are many models with similar weights.*

B) Use the best (top) model and do LRTs on the predictors to test for `significance.` 
*but, 1) combining information criteria and null hypothesis tests is not good since they are fundamentally different and results in scary monsters and angry reviewers.* 

*2) it could be seen as data-dredging by doing null-hypothesis tests on predictors that we already know have some pattern in the data*

C) Note that there are many models with similar weights, and that DOY, sample, and site are all in most of the best models so they have support from the data. Use the best model to report coefficient/effect size plots/etc. 

```{r}
best.model <- glm(count ~ site + sample +DOY, data=Salamanders, family="poisson")
```

Effect size plots for site, sample, and DOY:  
```{r}
effect_plot(model= best.model, pred=site)
effect_plot(model= best.model, pred=sample)
effect_plot(model= best.model, pred=DOY)
```

### Model selection (and how to make pretty data tables)

In lecture, we discussed how to calculate Aikake's Information Criterion (AIC) from the log-likelihood of a model. Remember that AIC is only a valid comparison between models that were fitted using exactly the same data. The absolute value of the AIC doesn't matter; we are only concerned about the *relative* value of AIC. A model with a smaller AIC value performs relatively better than a model with a larger AIC value. Don't get tripped up by negative AICs - **we are looking for overall smallest, not closest to 0**.

We can also compare models using the BIC. AIC and BIC should generally lead us to similar results, but BIC places a higher penalty on extra parameters. This helps to enforce the principle of parsimony - that a simpler model is best.

Finally, we can compare the adjusted R-squared value for each model to see how much variation in the y-variable is explained by each combination of predictors. We are looking for a model with high adjusted R-squared.

It is often convenient to print all of these values for each model as a neat table, so that your reader can compare them all at a glance. The function kable() in the package knitr allows us to print a dataframe as a table in html, pdf, or latex format (run in console, not in your markdown, to see the different formats). Look up the help documentation for kable using ?kable. You can specify left, right, or center alignment, give the number of digits to round to for the whole table or for each column individually, edit the names of your columns, and more to get the perfect beautiful table for your report.

```{r}
#calculate AIC of each model
result <- AIC(fit_1var,fit_2var,fit_3var,fit_4var,fit_full) #this will create a dataframe whose rownames are the models, with columns for the df and AIC of each model

#add other metrics to your table
models <- list(fit_1var,fit_2var,fit_3var,fit_4var,fit_full) #make sure you keep your models in the same order here as they were when you created your results table
result$BIC <- sapply(models, BIC) #add a column for BIC to the results

model_summary <- lapply(models, summary) #look up ?lapply if you have not used this function before

#now we will use a for loop to easily extract the R^2 and adj R^2 value for each model from its summary, and store them in new columns in the results table

for(i in 1:length(models)){ #this creates a variable i that starts with the value i=1
  result$rsq[i] <- model_summary[[i]]$r.squared #we assign the rsq value from model i to the i'th row of the column 'rsq' in the table 'results'
  result$adj_rsq[i] <- model_summary[[i]]$adj.r.squared #same for adjusted rsq
} #now we go back to the beginning of the for-loop, add 1 to the value of i, and do everything again


kable(result, digits = 2, align = "c")

```

### Assessing model fit

When you have chosen your best model based on AIC, BIC, adjusted $R^2$, and the principle of parsimony, you will want to check how well it fits the data. Are the residuals normally distributed? How large are the residuals? If you leave some datapoints out when you fit the model, can you use your model to predict those points accurately?

```{r}
#separate your data into a training set (most of the data) and a test set (a few observations, or <10% of rows)
splitter <- sample(1:nrow(plant_sub), 15, replace = F) #pick 15 random rows from plant_sub to reserve as test data
psub_train <- plant_sub[-splitter,] #leave those rows out of the training data
psub_test <- plant_sub[splitter,] #use them to create a set of test data

fit_4var_split <- lm(percent_cover ~ num_leaves_in_rosette + dist_from_edge_m + dist_from_tree_m + num_flowers, data = psub_train) #fit the final model, using JUST your training set as the data argument

prediction <- predict(fit_4var_split,psub_test) #use the fitted model to predict values for your test data

plot(psub_test$percent_cover, pch=1) #plot the actual test data values
points(prediction, pch=20, col = "red") #plot the model predictions for those points
```


# HOMEWORK

Using the airquality data in R, find a best model to predict ozone levels and assess how well your model performs. Treat this lab as a warm-up for your final project: make it look as neat and professional as you can.

1. Provide some background: What is ozone? Why might some of these variables affect its levels? Where does this data come from? 3-4 sentences is fine. You are required to cite at least one source when writing this response.

2. Read in the air quality data and visualize it. What are your potential predictors? What are some potential random factors? Do you think any of these predictors might interact with each other? Remove any missing data to 'clean up' the dataset.
```{r}
airqual <- airquality

```

3. Check your data for collinearity, and check whether predictors have a linear relationship with the response variable. Transform any data that violates assumptions (remember, not all predictors need to be normal as long as the model residuals are normal). Check for any major outliers. Describe what you find.

4. Fit a model with one predictor. Check its residuals for normality.

5. Fit at least two other models with different combinations of predictors. State whether you are including random effects or interaction terms.

6. Create a table of results to compare your models using AIC, BIC, and adjusted R-squared. Include the degrees of freedom of each model in the table so that you can consider parsimony in your final decision. *Note: mixed-effects models (models made with `lme()`) may not work in your table if you are comparing models made with `lm()`. Recommend making a table for models just made with `lm()`.

7. Choose the best model and explain why this was your choice. Check its residuals for normality. How well does this model fit the data? How well does it predict ozone levels?


### Appendix: using stepwise model selection (optional!)

Model selection is a very complex problem for statisticians, and we have just scratched the surface here. There are many other ways to choose appropriate combinations of predictor variables. One example is stepwise model selection. This process starts with a 'full model' (including all possible predictors) and eliminates variables one by one, calculating AIC each time. When the AIC can't be reduced any more by taking away variables or putting them back, you have the final optimized model. You can try stepwise AIC in R:

```{r}
library(MASS)
?stepAIC

fullmodel <- lm(Ozone~Solar.R+Wind+Temp+Month+Day, data = airqual)
nullmodel <- lm(Ozone~1, data = airqual)

stepAIC(fullmodel, scope = c(upper = fullmodel, lower = nullmodel), direction = "both")
```

How does including interactions affect this process? Do you get similar results from the automated selection as you did in the homework by using your biological intuition and comparing models by hand?


## Epidemic Cards Tutorial: computer exercise
## E2M2: Ecological and Epidemiological Modeling in Madagascar
## January 12-21, 2019

## Cara Brook, 2019

## By now, you've been exposed to some of the variety of model
## types used to understand infectious disease data. We'll now work
## with a few of those forms to explain our own data and compare the
## fits to the data we generated yesterday.


## By the end of this tutorial, you should be able to:
##  * Differentiate between deterministic and stochastic AND discrete and continuous time models.
##  * Load, run, and understand all models of these sorts.
##  * Run a function that fits various models to data by minimizing the sum of squares.
##  * Plot these various fits and compare them

##  * Fit a simple linear regression model and one incorporating random effects
##  * Compare these models performance by AIC

library(plyr)
library(lme4)
library(deSolve)

rm(list=ls())
######################################################################
## Part 1: Visualizing your data

## Earlier, you played an epidemic card game and generated several time
## series of infecteds and susceptibles. 

## (1) Import these data and select the subset for which R0 = 2.
##     Save that subset as an object called: dat.R0
dat <-  read.csv("Epidemic_Cards.csv", header=T)



## (2) Plot each trial as a thin, dashed line (red = infecteds; green = susceptibles)
## like you did yesterday.




######################################################################
## Part 2: Modeling your data in multiple discrete ways

## (3) Brainstorm the different types of models that you might use to
## reconstruct these data. Which type of model did you already build?
## Why did the data sometimes vary from the model predictions? What are
## other types of models you might try to fit?


## (4) Make a vector of values called 'timesteps' that spans from 1 to 10 
## and increases by increments of 1.



## (5) Rewrite your discrete time function from this morning. As before, 
## call it "SIR.discrete.deterministic" and have it take in the input arguments
## of 'R0' and 'time' and return a dataframe of 'time', 'S', and 'I'.
## Save that function to your global environment.



## (6) Check that it is working by running it with the input arguments
## R0=2 and time = timesteps.



## (7) Write a stochastic version of the same model called "SIR.discrete.stochastic"
## with the same input arguments and output dataframe. (Hint, use the built-in R function 
## rnorm() with sd=.5 to add some noise to R0 within each timestep.)


## (8)  Check that it is working by running it with the input arguments
## R0=2 and time = timesteps.



## (9) Run the stochastic model from #8 again above. Does the output look the
## same? Why or why not?

## (10) Save the output from #8 and #9 as the objects 'discrete.determ' and 
## 'discrete.stoch' and plot their output as thick lines atop the plot from #2.
## Color the S class friom discrete.stoch as 'forestgreen' and the I class as 
## orange so you can see that they are different. Which fits the data better?




## (11) Here is the sum_sq function you used for fitting your deterministic 
## model this morning:
sum_sq <- function(par, real.dat, time){
  ## 1. You will first need to run your model at your chosen parameters.
  out.model <- SIR.discrete.deterministic(R0=par, time=time)
  
  ## 2. You will then need to add two columns labeled "I_predictions" and "S_predictions" to your dat.R0
  ##    dataset and fill with 'NAs'
  real.dat$I_predictions <- real.dat$S_predictions <- NA
  
  ## 3. You will then next to fill those columns with model predictions at the appropriate timesteps
  ##    like those from #21. This will require a double for-loop over the unique trials and timesteps.
  
  ## Infecteds first:
  for(i in 1:length(unique(real.dat$trial))){
    for(j in 1:length(unique(real.dat$timestep))){
      real.dat$I_predictions[real.dat$trial==i & real.dat$timestep==j] <- out.model$I[out.model$time==j]
    }
  }
  
  ## And Susceptibles:
  for(i in 1:length(unique(real.dat$trial))){
    for(j in 1:length(unique(real.dat$timestep))){
      real.dat$S_predictions[real.dat$trial==i & real.dat$timestep==j] <- out.model$S[out.model$time==j]
    }
  }
  
  ## 4. You will need to calculate the sum of squared differences between all model predictions (both S
  ##    and I) from the data and return it from the function.)
  
  ## First, get both S and I differences
  real.dat$S_differences <- real.dat$S_predictions - real.dat$susceptibles
  real.dat$I_differences <- real.dat$I_predictions - real.dat$infecteds
  
  ## Then, square them and add them:
  sum.of.sqs = sum(c(real.dat$S_differences, real.dat$I_differences)^2)
  
  return(sum.of.sqs)
  
}
## Adapt it by adding an input argument called 'model' that allows you to choose 
## whether to run either SIR.discrete.deterministic() or SIR.discrete.stochastic().




## (12) Now, use the 'BFGS' method in the function optim() to minimize the difference 
## between your model and the data, and save the returned values from each model
## as objects called 'out.optim.deter' and  'out.optim.stoch'.
## Guess the value of R0 as 2, meaning that you should use the following input variables: 
## optim(par = 2, fn=sum_sq, real.dat=dat.R0, model = "discrete.deterministic", time=seq(1,10,1), method="BFGS")
## optim(par = 2, fn=sum_sq, real.dat=dat.R0, model = "discrete.stochasitic", time=seq(1,10,1), method="BFGS")



## (13) Save out.optim.deter$par as the object R0.discrete.deter.fit and out.optim.stoch$par as R0.discrete.stoch.fit.
## How close are these values to R0=2? Why might they be different?



## (14) Compare out.optim.deter$value and out.optim.stoch$value. Which model
## has the lower sum of squares and better fits the data?


## (15) Re-fit the stochastic model. Do you get the same estimate for R0? Why or 
## why not?


## (16) Run your both your models with their new fitted parameters and write over the 
## output from step #10 with your new predictions. Add these to the plot.




######################################################################
## Part 3: Modeling your data continuously

## (17) Take a look at the following continuous time, deterministic function
## that models the same system as the discrete time model shown above. Adapt the
## function to make a new function called "SIR.continuous.stochastic()" that 
## incorporates stochasticity in the form sd=.5 on the R0 value outside of the sub-ODE.
## The deterministic function:
SIR.continuous.deterministic <- function(R0, time){
  params = list(R0 = R0, gamma=1)
  x.start = c(S=25, I=1)
  ## load the sub-ODE function that the line runs below:
  SIR.ODE <-function(t,x,parms){
    
    #now for the differential equations
    with(c(as.list(c(x,parms))), {
      N = S + I
      
      dS = -R0*((S/N)*I)
      dI = R0*((S/N)*I) - gamma*I
      
      
      
      return(list(c(dS,dI)))
    })
  }
  out <- as.data.frame(lsoda(x.start, time, SIR.ODE, params))
  return(out)
  
  
}



## (18) Test both models at R0=2 to see that they run:



## (19) Do you expect to get the same output every time from both? Try them again to see.



## (20) Edit the sum_sq function from #11 above to allow you to choose to run these
## two additional models too.



## (21) Fit both continuous time models with optim, using the edited sm_sq 
## function from 20. Save output, respectively as out.cont.deter and out.cont.stoch.



## (22) How do these fits compare with the discrete time fits above? Run the fitted
## models and add these new lines to the plot. 




######################################################################
## Part 4: BONUS: TSIR. Modeling your data statistically, but with mechanism

## We've stressed in this course that statistical models tell us about 
## 'correlation' while mechanstic models tell us about 'causation'...But there 
## are a few exceptions. We CAN understand something mechanistic from a 
## statistical model if our x and y regression terms have a mechanistic relationship.

## Recall that our discrete time model for this system is written as:

## S[t+1] = S[t] - (R0 * S[t]/N)*I[t]
## I[t+1]= (R0 * S[t]/N)*I[t]

## Since there is only 1 term in the I equation, we can re-imagine it as:
## y = m * x where:
## y = I[t+1], m = R0, and  x = (S[t]/N)*I[t])
## (b=0 in this case)

## (23) Here, we provide a function that takes a dataset from this model
## and converts it to xy coordinates. Use this function to save a new dataset called
## xy.dat, derived from the R0=2 subset of our original data. (Hint: you will need
## to 'apply' the function over all timesteps in each distinct trial).
## Function:
get.xy <- function(data){
  
  y <- data$infecteds[2:length(data$infecteds)]
  
  x1 <- data$infecteds[1:(length(data$infecteds)-1)]
  x2 <- data$susceptibles[1:(length(data$susceptibles)-1)]
  
  x = x1*x2/26
  
  
  new.dat <- cbind.data.frame(x,y, unique(data$trial))
  names(new.dat) <- c("x", "y", "trial")
  
  return(new.dat)
}

## A. First, make a list of the data separated by trial, using 'dlply'


## B. then apply a function to get the x and y predictors by each trial


## C. Then, and rejoin the list into a dataset


## D. make sure "trial" is a factor to use it as a random effect


## E. and sort by ascending values of Ix


## (24) Plot the values of x vs. y to visualize:


## (25) Fit a simple linear regression, lm(), to these data, and 
## save the output as the object, 'simple.lm'
## Plot a fit line from simple.lm in red.



## (26) Look at the second coefficient of the regression model. 
## What parameter does this signify?


## (27) Examine the AIC. Can we learn anything from this?


## (28) Now fit a model incorporating random effects of 'trial' to these
## data. Save the output as the object 're.lmer'. Plot a fit line
## from this model in 'green'



## Do you notice anything unusual about this line? Why or why not?

## (29) Compare the AIC of re.lmer with simple.lm (above). Which fits the data better?
## Which model would you choose?

