---
title: "Lab 8: Multiple Regression and Model Selection"
author: "Caroline Owens, edited by Sam Sambado"
date: "2/21/2021"
output: html_document
---


```{r setup, include=FALSE, messsage = FALSE}
knitr::opts_chunk$set(echo = TRUE)

# packages needed for lab 8
# If you haven't used a package before you will need to install it with install.packages("package") 
library(readr) # for reading in csv files
library(tidyverse) # for data wrangling 
library(ggplot2) # for data visualizing 
library(car) #for the qqPlot function
library(psych) #for pairs.panels()

# new package alert! 
# install.packages("knitr")
library(knitr) # we will use this package to make beautiful tables
library(nlme) #for fitting mixed effects models (linear models with random effects) ~ new pacakge alert! ~

# datasets needed for lab 8
plant <- read_csv("plant_data.csv") # obviously 
 
airquality <- airquality 


?cbind()
```


## Multivariate Linear Models

At this point in the quarter, you know several ways of analyzing the relationship between two variables - that is, whether one variable is *predicted by* another. We learned last week that a linear model with one predictor variable can be written as $$Y_i = \beta_0+\beta_1X_i+\epsilon_i$$ where $Y_i$ is the value of your response variable, $X_i$ is the value of your predictor variable, and $\epsilon_i$ is the value of the residual for that point. 

In real life, however, one predictor can rarely explain all of the patterns in your response variable. More often, a response can be predicted by a combination of predictors. For example, your height may be predicted by some combination of your parents' height, your diet and nutrition, your age, and other factors.
We write these multivariate linear models as 
$$Y = \beta_0+\beta_1X_1+ \beta_2X_2 + ... + \beta_nX_n+\epsilon$$
Notice that you can have as many predictors (X-variables) as you want, and that each one can be more or less important to the final value of Y depending on the relative size of its coefficient. Our height example might look like this:

$$Adult~height = (0.4)*(Mom's~height) + (0.4)*(Dad's~height)+2*(nutrition~score)+\epsilon$$

Our hope is that by adding predictors, we will be able to explain more and more of the variability in the response variable in the model, increasing our $R^2$ value and decreasing the size of the residuals. However, each additional predictor adds another coefficient that must be estimated to fit our model. Estimating these extra parameters decreases the degrees of freedom and decreases the power of the analysis. For this reason, you can never have more predictors than you have data points (i.e. your dataset should be longer than it is wide). You will get the best model performance when you have *many* more data points than predictors.

When assessing the fit of a multiple regression model, we always use the *adjusted $R^2$* value instead of the regular $R^2$. The adjustment takes into account how many parameters have been estimated to build the model, as well as the sample size. 

### Assumptions of multiple regression

You can think of a single variable linear model (LM) as a multiple regression model with all coefficients $B_n,~n>1$ set to 0, so all the assumptions from our one-variable linear model still apply to multiple regressions. However, we also add some new assumptions to handle possible interactions between the x-variables. One important new assumption is that no two predictors completely explain each other. 
A good way to check this is by using the pairs.panels() function from the psych package again. We use the scatter plots to look for variables that have a high correlation with each other. We are hoping that our predictors will have a linear relationship with our response variable (and we can transform our predictors to improve that fit if we need to). However, if we see any strong linear patterns between variables, we need to drop one of those predictors from the model. 

** Clarification: not all of your predictors need to be normally distributed, as long as the residuals of your model are normal. **

To summarize our assumptions for a general linear model, before conducting multiple regression we have to check that:

  + observations represent a random sample
  
  + there is no collinearity between predictors
  
  + there is some linear relationship between predictors and the response

  + residuals of the model are normal 

  + residuals of the model have equal variance


### Types of multiple regression

*ANCOVA* includes a categorical (group) variable as well as a continuous (regression) variable in the set of predictors. One key assumption of ANCOVA is that within each group, the slope (relationship between continuous predictor and response) is the same.

If you think that the slope varies depending on the value of other predictors, you need to fit a model with *interaction terms*. lm(y~x1*x2) predicts y based on x1, x2, and the interaction of x1 and x2.

A variable that doesn't have a linear relationship with your response, but may be affecting its value, can be included as a *random* or *block* effect. This is a way of accounting for confounding variables, or groups that are outside your experimental design but still could affect the value of your response. To include random effects in your model, you will need to use the function lme() instead of lm(). The syntax of a model with random effects is lme(y~x1 + x2, random = ~1|g1, data = dataset). These models will often fail to run if your dataset includes NA values, so be sure to clean the dataset using na.omit() first.

### Practice

Let's use some familiar data to practice fitting a multivariate model. Load and inspect the plant_data.csv data. Use str() to look at the structure of your data. Be sure that your continuous variables are in numeric column format, and your categorical variables are in factor format.
```{r message=FALSE}
plant <- read_csv("plant_data.csv")
str(plant) # 259 observations, 10 variables
plant$tran_number <- as.factor(plant$tran_number)
plant$day_of_week <- trimws(plant$day_of_week, which = "both") #this gets rid of the spaces around the words
plant$day_of_week <- as.factor(plant$day_of_week)
plant <- na.omit(plant)
```

Use the function pairs_panels() to check whether any of the variables exhibit significant collinearity.
```{r}
pairs.panels(plant, lm = TRUE, cor = T) #note - num_flowers and num_dand_flowers are highly correlated

plant_sub <- subset(plant, select = -num_dand_flowers)

pairs.panels(plant_sub, lm = TRUE) #now there are no very strong correlations


```

Fit a model with one predictor. Then fit a second model using that predictor plus some others that you think will help explain more of the variation. Compare the $R^2$ and adjusted $R^2$ values for these models. Is adding predictors helpful?
```{r}

# single variable linear model
fit_1var <- lm(percent_cover ~ num_leaves_in_rosette, 
               data = plant_sub)

# multiple variables linear model
fit_2var <- lm(percent_cover ~ num_leaves_in_rosette + dist_from_edge_m, 
               data = plant_sub) 

fit_3var <- lm(percent_cover ~ num_leaves_in_rosette + dist_from_edge_m + dist_from_tree_m, 
               data = plant_sub)

fit_4var <- lm(percent_cover ~ num_leaves_in_rosette + dist_from_edge_m + dist_from_tree_m + num_flowers, 
               data = plant_sub) 

fit_full <- lm(percent_cover ~ num_leaves_in_rosette + dist_from_edge_m + dist_from_tree_m + num_flowers + dand_rosette_diam_cm + day_of_week + sp_richness, 
               data = plant_sub)

# mixed effect model
mixed_effects <- lme(percent_cover ~ num_leaves_in_rosette, random = ~1|tran_number, data = na.omit(plant_sub))

#Look at diagnostic plots for each model
par(mfrow = c(2,2))
plot(fit_full) 
plot(fit_1var)
par(mfrow = c(1,1))

```

### Model selection (and how to make pretty data tables)

In lecture, we discussed how to calculate Aikake's Information Criterion (AIC) from the log-likelihood of a model. Remember that AIC is only a valid comparison between models that were fitted using exactly the same data. The absolute value of the AIC doesn't matter; we are only concerned about the *relative* value of AIC. A model with a smaller AIC value performs relatively better than a model with a larger AIC value. Don't get tripped up by negative AICs - **we are looking for overall smallest, not closest to 0**.

We can also compare models using the BIC. AIC and BIC should generally lead us to similar results, but BIC places a higher penalty on extra parameters. This helps to enforce the principle of parsimony - that a simpler model is best.

Finally, we can compare the adjusted R-squared value for each model to see how much variation in the y-variable is explained by each combination of predictors. We are looking for a model with high adjusted R-squared.

It is often convenient to print all of these values for each model as a neat table, so that your reader can compare them all at a glance. The function kable() in the package knitr allows us to print a dataframe as a table in html, pdf, or latex format (run in console, not in your markdown, to see the different formats). Look up the help documentation for kable using ?kable. You can specify left, right, or center alignment, give the number of digits to round to for the whole table or for each column individually, edit the names of your columns, and more to get the perfect beautiful table for your report.

```{r}
#calculate AIC of each model
result <- AIC(fit_1var,fit_2var,fit_3var,fit_4var,fit_full) #this will create a dataframe whose rownames are the models, with columns for the df and AIC of each model

#add other metrics to your table
models <- list(fit_1var,fit_2var,fit_3var,fit_4var,fit_full) #make sure you keep your models in the same order here as they were when you created your results table
result$BIC <- sapply(models, BIC) #add a column for BIC to the results

model_summary <- lapply(models, summary) #look up ?lapply if you have not used this function before

#now we will use a for loop to easily extract the R^2 and adj R^2 value for each model from its summary, and store them in new columns in the results table

for(i in 1:length(models)){ #this creates a variable i that starts with the value i=1
  result$rsq[i] <- model_summary[[i]]$r.squared #we assign the rsq value from model i to the i'th row of the column 'rsq' in the table 'results'
  result$adj_rsq[i] <- model_summary[[i]]$adj.r.squared #same for adjusted rsq
} #now we go back to the beginning of the for-loop, add 1 to the value of i, and do everything again


kable(result, digits = 2, align = "c")

```

### Assessing model fit

When you have chosen your best model based on AIC, BIC, adjusted $R^2$, and the principle of parsimony, you will want to check how well it fits the data. Are the residuals normally distributed? How large are the residuals? If you leave some datapoints out when you fit the model, can you use your model to predict those points accurately?

```{r}
#separate your data into a training set (most of the data) and a test set (a few observations, or <10% of rows)
splitter <- sample(1:nrow(plant_sub), 15, replace = F) #pick 15 random rows from plant_sub to reserve as test data
psub_train <- plant_sub[-splitter,] #leave those rows out of the training data
psub_test <- plant_sub[splitter,] #use them to create a set of test data

fit_4var_split <- lm(percent_cover ~ num_leaves_in_rosette + dist_from_edge_m + dist_from_tree_m + num_flowers, data = psub_train) #fit the final model, using JUST your training set as the data argument

prediction <- predict(fit_4var_split,psub_test) #use the fitted model to predict values for your test data

plot(psub_test$percent_cover, pch=1) #plot the actual test data values
points(prediction, pch=20, col = "red") #plot the model predictions for those points
```


# HOMEWORK

Using the airquality data in R, find a best model to predict ozone levels and assess how well your model performs. Treat this lab as a warm-up for your final project: make it look as neat and professional as you can.

1. Provide some background: What is ozone? Why might some of these variables affect its levels? Where does this data come from? 3-4 sentences is fine. You are required to cite at least one source when writing this response.

2. Read in the air quality data and visualize it. What are your potential predictors? What are some potential random factors? Do you think any of these predictors might interact with each other? Remove any missing data to 'clean up' the dataset.
```{r}
airqual <- airquality

```

3. Check your data for collinearity, and check whether predictors have a linear relationship with the response variable. Transform any data that violates assumptions (remember, not all predictors need to be normal as long as the model residuals are normal). Check for any major outliers. Describe what you find.

4. Fit a model with one predictor. Check its residuals for normality.

5. Fit at least two other models with different combinations of predictors. State whether you are including random effects or interaction terms.

6. Create a table of results to compare your models using AIC, BIC, and adjusted R-squared. Include the degrees of freedom of each model in the table so that you can consider parsimony in your final decision. *Note: mixed-effects models (models made with `lme()`) may not work in your table if you are comparing models made with `lm()`. Recommend making a table for models just made with `lm()`.

7. Choose the best model and explain why this was your choice. Check its residuals for normality. How well does this model fit the data? How well does it predict ozone levels?


### Appendix: using stepwise model selection (optional!)

Model selection is a very complex problem for statisticians, and we have just scratched the surface here. There are many other ways to choose appropriate combinations of predictor variables. One example is stepwise model selection. This process starts with a 'full model' (including all possible predictors) and eliminates variables one by one, calculating AIC each time. When the AIC can't be reduced any more by taking away variables or putting them back, you have the final optimized model. You can try stepwise AIC in R:

```{r}
library(MASS)
?stepAIC

fullmodel <- lm(Ozone~Solar.R+Wind+Temp+Month+Day, data = airqual)
nullmodel <- lm(Ozone~1, data = airqual)

stepAIC(fullmodel, scope = c(upper = fullmodel, lower = nullmodel), direction = "both")
```

How does including interactions affect this process? Do you get similar results from the automated selection as you did in the homework by using your biological intuition and comparing models by hand?

### Appendix: ozone in the news
https://www.npr.org/sections/health-shots/2020/05/19/854760999/traffic-is-way-down-due-to-lockdowns-but-air-pollution-not-so-much