Correlation Coefficients and Predictive Models
A vignette on how to properly use correlation coefficients when evaluating predictive models
Introduction
The correlation coefficient (R2), which indicates the proportion of variation explained by an independent variable or model, can be misused/incorectly used when evaluating predictive models. Depending on the situation, different calculations for R2 can be used to convey different meanings. The goal of this vignette is to help clear up this confusion, with a few different examples.
Prepare Data
# devtools::install_github("derekmichaelwright/agData")
library(agData)
library(readxl) # read_xlsx()
library(ggpmisc) # stat_poly_eq()
Dataset 1
Entry
: 1 Lentil GenotypeExpt
: 18 Site-yearsMacroEnv
: 3 MacroenvironmentsRep
: 1-3 Reps per site-yearT_mean
: Mean temperatureP_mean
: Mean photoperiodDTF
: Days from sowing to flowerDTM
: Days from sowing to swollen podDTM
: Days from sowing to maturityRDTF
: Rate of progress towards flowering =1 / DTF
# Prep data
<- read_xlsx("data_correlation_coefficients.xlsx", "data_1") %>%
d1 mutate(RDTF = 1 / DTF) # Rate of progress towards flowering
# Preview data
as_tibble(d1)
## # A tibble: 54 × 10
## Entry Expt MacroEnv Rep T_mean P_mean DTF DTS DTM RDTF
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 33 Ro16 Macroenvironment 1 1 17.2 16.2 45 62 96 0.0222
## 2 33 Ro16 Macroenvironment 1 2 17.2 16.2 44 64 96 0.0227
## 3 33 Ro16 Macroenvironment 1 3 17.2 16.2 44 61 92 0.0227
## 4 33 Ro17 Macroenvironment 1 1 17.5 16.4 38 60 82 0.0263
## 5 33 Ro17 Macroenvironment 1 2 17.5 16.4 39 57 80 0.0256
## 6 33 Ro17 Macroenvironment 1 3 17.5 16.4 39 57 77 0.0256
## 7 33 Su16 Macroenvironment 1 1 16.7 15.9 44 65 95 0.0227
## 8 33 Su16 Macroenvironment 1 2 16.7 15.9 45 62 94 0.0222
## 9 33 Su16 Macroenvironment 1 3 16.7 15.9 45 63 95 0.0222
## 10 33 Su17 Macroenvironment 1 1 15.7 16.1 44 64 83 0.0227
## # ℹ 44 more rows
Dataset 2
Entry
: 324 GenotypesExpt
: 3 Site-yearsMacroEnv
: 3 MacroenvironmentsDTF
: Days from sowing to flowerPredicted_DTF
: Predicted values of DTF
# Prep data
<- read_xlsx("data_correlation_coefficients.xlsx", "data_2") %>%
d2 mutate(Expt = factor(Expt, levels = c("Saskatchewan", "Nepal", "Italy")))
# Preview data
as_tibble(d2)
## # A tibble: 972 × 5
## Entry Expt MacroEnv DTF Predicted_DTF
## <dbl> <fct> <chr> <dbl> <dbl>
## 1 1 Saskatchewan Macroenvironment 1 53.7 51.6
## 2 1 Nepal Macroenvironment 2 123 98.0
## 3 1 Italy Macroenvironment 3 137. 145.
## 4 2 Saskatchewan Macroenvironment 1 57.3 55.3
## 5 2 Nepal Macroenvironment 2 124. 93.5
## 6 2 Italy Macroenvironment 3 128 137.
## 7 3 Saskatchewan Macroenvironment 1 59 57.8
## 8 3 Nepal Macroenvironment 2 122. 96.0
## 9 3 Italy Macroenvironment 3 134 142.
## 10 4 Saskatchewan Macroenvironment 1 56 53.7
## # ℹ 962 more rows
Correlation Between Two Traits
The most obvious correlation is that of two traits. For example,
since DTF
, DTS
, and DTM
are all
phenology traits, we can assume they will be highly correlated. Lets
plot this out and see how it looks.
Interpretations:
- 88.3% of variation in
DTS
is explained byDTF
. - 92.7% of variation in
DTM
is explained byDTS
. - 88.7% of variation in
DTM
is explained byDTF
.
# Prep data
<- "www.dblogr.com/ or derekmichaelwright.github.io/dblogr/ | Data: AGILE"
myCaption <- c("darkgreen","darkorange","darkblue")
myColors # Remove any rows with missing data
<- d1 %>% filter(!is.na(DTF), !is.na(DTS), !is.na(DTM))
d1 # DTF - DTS
<- ggplot(d1, aes(x = DTF, y = DTS)) +
mp1 geom_smooth(method = "lm", se = F, color = "red") +
geom_point(aes(fill = MacroEnv), size = 3, pch = 21, alpha = 0.7) +
scale_fill_manual(name = NULL, values = myColors) +
stat_poly_eq(aes(label = ..rr.label..), rr.digits = 3, size = 4,
formula = y ~ x, parse = T, geom = "label_npc") +
theme_agData() +
labs(title = "A) DTF x DTS", x = "Days to Flower", y = "Days to Swollen Pods")
# DTS - DTM
<- ggplot(d1, aes(x = DTS, y = DTM)) +
mp2 geom_smooth(method = "lm", se = F, color = "red") +
geom_point(aes(fill = MacroEnv), size = 3, pch = 21, alpha = 0.7) +
scale_fill_manual(name = NULL, values = myColors) +
stat_poly_eq(aes(label = ..rr.label..), rr.digits = 3, size = 4,
formula = y ~ x, parse = T, geom = "label_npc") +
theme_agData() +
labs(title = "B) DTS x DTM", x = "Days to Swollen Pods", y = "Days to Maturity")
# DTF - DTM
<- ggplot(d1, aes(x = DTF, y = DTM)) +
mp3 geom_smooth(method = "lm", se = F, color = "red") +
geom_point(aes(fill = MacroEnv), size = 3, pch = 21, alpha = 0.7) +
scale_fill_manual(name = NULL, values = myColors) +
stat_poly_eq(aes(label = ..rr.label..), rr.digits = 3, size = 4,
formula = y ~ x, parse = T, geom = "label_npc") +
theme_agData() +
labs(title = "C) DTF x DTM", x = "Days to Flower", y = "Days to Maturity",
caption = myCaption)
# Append and Save
<- ggarrange(mp1, mp2, mp3, nrow = 1, align = "h",
mp common.legend = T, legend = "bottom")
ggsave("correlation_coefficients_01.png", mp, width = 10, height = 4, bg = "white")
Calculating R2
Pearson’s R2
R2 can be calculated using the cor
function which has method = "pearson"
as the default.
cor(x = d1$DTF, y = d1$DTS, method = "pearson")^2
## [1] 0.8833893
cor(x = d1$DTS, y = d1$DTM, method = "pearson")^2
## [1] 0.9272951
cor(x = d1$DTF, y = d1$DTM, method = "pearson")^2
## [1] 0.8868398
R2 can also be manually calculated using the following formula to calculate Pearson’s r:
\(r=\frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{(n\sum x^2 - (\sum x)^2)(n\sum y^2 - (\sum y)^2)}}\)
where:
- \(x\) = Independant variable
- \(y\) = Dependant Variable
- \(n\) = Number of observations
and
\(R^2=r^2\)
Now lets manually create a function to calculate R2 using Pearson’s r.
<- function(x, y) {
pearsonsR2 <- length(x)
n <- n * sum(x*y) - sum(x) * sum(y)
r_numerator <- sqrt((n * sum(x^2) - sum(x)^2) * (n * sum(y^2) - sum(y)^2))
r_denominator <- r_numerator / r_denominator
r ^2
r
}pearsonsR2(x = d1$DTF, y = d1$DTS)
## [1] 0.8833893
pearsonsR2(x = d1$DTS, y = d1$DTM)
## [1] 0.9272951
pearsonsR2(x = d1$DTF, y = d1$DTM)
## [1] 0.8868398
Switching the x
and y
variables gives the
same result.
pearsonsR2(x = d1$DTM, y = d1$DTF)
## [1] 0.8868398
cor(x = d1$DTM, y = d1$DTF, method = "pearson")^2
## [1] 0.8868398
Note that the stat_poly_eq
function included the option
formula = y ~ x
, which creates a linear regression of the
x
and y
variables. Our R2
is a measure how how well the DTM
data matches the
predictions of DTM based on our DTF
data and
linear regression model.
# Perform linear regression
<- lm(DTM ~ DTF, data = d1)
myModel summary(myModel)
##
## Call:
## lm(formula = DTM ~ DTF, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.443 -6.744 -1.465 5.132 23.879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.91678 3.82211 12.54 2.78e-16 ***
## DTF 0.95699 0.05096 18.78 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.601 on 45 degrees of freedom
## Multiple R-squared: 0.8868, Adjusted R-squared: 0.8843
## F-statistic: 352.7 on 1 and 45 DF, p-value: < 2.2e-16
summary(myModel)$r.squared
## [1] 0.8868398
# Get predicted values and residual values from model
<- d1 %>% mutate(Predicted_DTM = predict(myModel),
d1 Residuals_DTM = residuals(myModel) )
We can also calculate R2 by replacing
DTF
with with the Predicted_DTM
values from
our linear regression model.
pearsonsR2(x = d1$Predicted_DTM, y = d1$DTM)
## [1] 0.8868398
Again, switching the x and y variables gives the same result.
pearsonsR2(x = d1$DTM, y = d1$Predicted_DTM)
## [1] 0.8868398
Sum of Squares R2
If you have Observed and Predicted values, R2 can also be calculated using the Sum of Squares formula:
\(R^2=1-\frac{SS_{residuals}}{SS_{total}}=1-\frac{\sum (o-p)^2}{\sum (o-\bar{o})^2}\)
where:
- \(o\) = Observed value
- \(p\) = Predicted value
- \(\bar{o}=\frac{\sum o}{n}\) = Mean of observed values
Now lets manually create a function to calculate R2 using the Sum of Squares.
<- function(o, p) { 1 - ( sum((o - p)^2) / sum((o - mean(o))^2) ) } SumOfSquaresR2
Create a plot to visualize the Total Sum of Squares and Residual Sum of Squares
# Prep data
<- d1 %>% filter(Expt %in% c("Ro17","Ne17","Sp17","It17"), Rep == 2)
xx # Total Sum of Squares Plot
<- ggplot(xx, aes(x = DTF, y = DTM)) +
mp1 geom_text(x = 40, y = 120, size = 5,
label = expression(italic(bar("y"))), parse = T) +
geom_rect(alpha = 0.3, fill = "coral", color = alpha("black",0.5),
aes(xmin = DTF, xmax = DTF + (mean(d1$DTM, na.rm = T) - DTM),
ymin = DTM, ymax = mean(d1$DTM, na.rm = T))) +
geom_hline(yintercept = mean(d1$DTM, na.rm = T), color = "coral") +
geom_point(size = 2, ) +
geom_point(data = d1,alpha = 0.3) +
theme_agData() +
labs(title = "A) Total Sum of Squares",
y = "Days to Maturity", x = "Days to Flower")
# Residual Sum of Squares Plot
<- ggplot(xx, aes(x = DTF, y = DTM)) +
mp2 geom_text(x = 70, y = 120, size = 5,
label = expression(italic("f")), parse = T) +
geom_rect(alpha = 0.3, fill = "darkorchid", color = alpha("black",0.5),
aes(xmin = DTF, xmax = DTF - Residuals_DTM,
ymin = DTM, ymax = Predicted_DTM)) +
geom_smooth(data = d1, method = "lm", se = F, color = "darkorchid") +
geom_point(size = 2) +
geom_point(data = d1, alpha = 0.3) +
theme_agData() +
labs(title = "B) Residual Sum of Squares",
y = "Days to Maturity", x = "Days to Flower",
caption = myCaption)
# Appened
<- ggarrange(mp1, mp2, ncol = 2, align = "h")
mp ggsave("correlation_coefficients_02.png", mp, width = 8, height = 4)
In this case:
- \(o\) = Observed value =
DTM
- \(p\) = Predicted value =
Predicted_DTM
- \(\bar{o}\) = Mean of observed
values =
mean(DTM, na.rm = T)
Now lets calculate R2 using the Sum of Squares formula.
SumOfSquaresR2(o = d1$DTM, p = d1$Predicted_DTM)
## [1] 0.8868398
Now, Swaping the x
and y
variables results
in different values. Why?
SumOfSquaresR2(p = d1$DTM, o = d1$Predicted_DTM)
## [1] 0.8724006
It is important to set DTM
as the observed
variable and Predicted_DTM
as the predicted
variable. This is becuase with the sum of squares model, our trendline
has a slope = 1
and intercept = 0
(geom_abline
), which is matched by
lm(DTM~Predicted_DTM)
but not
lm(Predicted_DTM~DTM)
.
# Intercept = 0, Slope = 1
round(lm(DTM ~ Predicted_DTM, data = d1)$coefficients, 3)
## (Intercept) Predicted_DTM
## 0 1
# Intercept != 0, Slope != 1
round(lm(Predicted_DTM ~ DTM, data = d1)$coefficients, 3)
## (Intercept) DTM
## 12.980 0.887
This can be visualized by showing how the trendline
(geom_smooth(method = "lm")
) devaites from the 1:1 line
(geom_abline()
).
<- min(c(d1$DTM, d1$Predicted_DTM))
mymin <- max(c(d1$DTM, d1$Predicted_DTM))
mymax <- ggplot(d1, aes(y = DTM, x = Predicted_DTM)) +
mp1 geom_smooth(method = "lm", se = F, size = 2, color = "red") +
geom_abline(color = "blue") +
geom_point(aes(fill = MacroEnv), size = 3, pch = 21, alpha = 0.7) +
scale_fill_manual(values = c("darkgreen","darkorange","darkblue")) +
xlim(c(mymin, mymax)) +
ylim(c(mymin, mymax)) +
theme_agData() +
labs(title = "A)")
<- ggplot(d1, aes(x = DTM, y = Predicted_DTM)) +
mp2 geom_smooth(method = "lm", formula = y ~ x, se = F, size = 2, color = "red") +
geom_abline(color = "blue") +
geom_point(aes(fill = MacroEnv), size = 3, pch = 21, alpha = 0.7) +
scale_fill_manual(values = c("darkgreen","darkorange","darkblue")) +
xlim(c(mymin, mymax)) +
ylim(c(mymin, mymax)) +
theme_agData() +
labs(title = "B)", caption = myCaption)
<- ggarrange(mp1, mp2, ncol = 2, legend = "none", align = "h")
mp ggsave("correlation_coefficients_03.png", mp, width = 8, height = 4)
Why is this important? Improper usage can lead to
inncorrect interpretations. The example above involved inncorrectly
calculating R2 by swapping our observed and
predicted variables within the Sum of Squares (SS)
formula. Another example would be using the Sum of Squares
formula without predictions vs. observations. E.g., if we
calculate R2 using our SumOfSquaresR2
function with DTF
vs DTM
.
SumOfSquaresR2(o = d1$DTF, p = d1$DTM)
## [1] -1.789874
<- min(d1$DTF)
mymin <- max(d1$DTM)
mymax <- ggplot(d1, aes(x = DTF, y = DTM)) +
mp1 geom_abline(color = "blue") +
geom_smooth(method = "lm", se = F, color = "red") +
geom_segment(aes(xend = DTF, yend = Predicted_DTM)) +
geom_point(aes(fill = MacroEnv), size = 3, pch = 21, alpha = 0.7) +
scale_fill_manual(values = c("darkgreen","darkorange","darkblue")) +
xlim(c(mymin, mymax)) + ylim(c(mymin, mymax)) +
theme_agData() +
labs(title = substitute(paste("A) ", italic("Pearson's R")^2, " = ", r2),
list(r2 = round(pearsonsR2(x = d1$DTF, y = d1$DTM), 3))))
<- ggplot(d1, aes(x = DTF, y = DTM)) +
mp2 geom_abline(color = "blue") +
geom_smooth(method = "lm", se = F, color = "red") +
geom_segment(aes(xend = DTF, yend = DTF)) +
geom_point(aes(fill = MacroEnv), size = 3, pch = 21, alpha = 0.7) +
scale_fill_manual(values = c("darkgreen","darkorange","darkblue")) +
xlim(c(mymin, mymax)) + ylim(c(mymin, mymax)) +
theme_agData() +
labs(title = substitute(paste("B) ", italic("Sum of Squares R")^2, " = ", r2),
list(r2 = round(SumOfSquaresR2(d1$DTF, d1$DTM), 3))),
caption = myCaption)
<- ggarrange(mp1, mp2, nrow= 1, ncol = 2, legend = "none", align = "h")
mp ggsave("correlation_coefficients_04.png", mp, width = 8, height = 4)
However, the more common mistake happens when the Pearson’s formula is used for evaluating the accuracy of a model used to predict values.
<- min(c(d2$DTF, d2$Predicted_DTF))
mymin <- max(c(d2$DTF, d2$Predicted_DTF))
mymax <- ggplot(d2, aes(x = Predicted_DTF, y = DTF)) +
mp geom_point(aes(color = MacroEnv), alpha = 0.3) +
geom_abline(color = "blue") +
geom_smooth(method = "lm", se = F, color = "red") +
facet_wrap(Expt~., scales = "free", ncol = 3) +
stat_poly_eq(formula = y ~ x, parse = T, rr.digits = 7) +
scale_color_manual(values = c("darkgreen","darkorange","darkblue")) +
xlim(c(mymin, mymax)) + ylim(c(mymin, mymax)) +
theme_agData(legend.position = "none") +
labs(title = expression(paste("Incorrect usage of ",
italic("Pearson's R")^2)),
caption = myCaption)
ggsave("correlation_coefficients_05.png", mp, width = 8, height = 4)
<- function(expt, color) {
my_ggplot # Prep data
<- d2 %>% filter(Expt == expt)
xx <- xx %>%
xx mutate(Trendline = predict(lm(DTF ~ Predicted_DTF, data = xx)))
<- min(c(xx$DTF, xx$Predicted_DTF))
mymin <- max(c(xx$DTF, xx$Predicted_DTF))
mymax # Plot
<- ggplot(xx, aes(x = Predicted_DTF, y = DTF)) +
mp1 geom_segment(aes(xend = Predicted_DTF, yend = Trendline), alpha = 0.2) +
geom_smooth(method = "lm", se = F, color = "red") +
geom_point(color = color, alpha = 0.5) +
geom_abline(color = "blue") +
xlim(c(mymin, mymax)) + ylim(c(mymin, mymax)) +
theme_agData() +
labs(title = expt,
subtitle = substitute(paste(italic("Pearson's R")^2, " = ", r2),
list(r2 = round(pearsonsR2(x = xx$DTF, y = xx$Predicted_DTF), 3))))
<- ggplot(xx, aes(x = Predicted_DTF, y = DTF)) +
mp2 geom_segment(aes(xend = Predicted_DTF, yend = Predicted_DTF), alpha = 0.2) +
geom_point(color = color, alpha = 0.5) +
geom_abline(color = "blue") +
stat_poly_eq(formula = y ~ x, parse = T, rr.digits = 7) +
xlim(c(mymin, mymax)) + ylim(c(mymin, mymax)) +
theme_agData() +
labs(subtitle = substitute(paste(italic("Sum of Squares R")^2, " = ", r2),
list(r2 = round(SumOfSquaresR2(xx$DTF, xx$Predicted_DTF), 3))),
caption = myCaption)
ggarrange(mp1, mp2, ncol = 2, align = "h")
}
<- my_ggplot("Saskatchewan", color = "darkgreen")
mp ggsave("correlation_coefficients_06.png", mp, width = 8, height = 4)
<- my_ggplot("Nepal", color = "darkorange")
mp ggsave("correlation_coefficients_07.png", mp, width = 8, height = 4)
<- my_ggplot("Italy", color = "darkblue")
mp ggsave("correlation_coefficients_08.png", mp, width = 8, height = 4)
Negative R2 values
Using Pearson’s formula R2 values will always fall between 0 and 1. However, when using the Sum of Squares formula (which has an intercept of 0), negative values can be aquired.
# Calculate R^2 for Ne17
<- d2 %>% filter(Expt == "Nepal")
xx SumOfSquaresR2(o = xx$DTF, p = xx$Predicted_DTF)
## [1] -2.581473
R2 for Ne17
is < 0, What does
this mean? and how do we interpret it?
R2 = 1 = regression is perfect, no errors
since,
\(R^2=1-\frac{SS_{residuals}}{SS_{total}}=1-\frac{0}{\sum (x-\bar{x})}=1\)
# Calculate R^2 of perfect regression
SumOfSquaresR2(o = xx$DTF, p = xx$DTF)
## [1] 1
R2 = 0 = regression is no better than using the mean
since,
\(y=\bar{x}\)
and,
\(R^2=1-\frac{SS_{residuals}}{SS_{total}}=1-\frac{\sum (x-\bar{x})}{\sum (x-\bar{x})}=1-1=0\)
# Prep data
<- xx %>% mutate(Mean_DTF = mean(DTF))
xx # Calculate R^2 with mean
SumOfSquaresR2(o = xx$DTF, p = xx$Mean_DTF)
## [1] 0
R2 < 0 = regression is worse than using the mean.
\(R^2=1-\frac{SS_{residuals}}{SS_{total}}=1-(>1)=(<0)\)
# Calculate the Sum of Squares for the residuals
sum((xx$DTF - xx$Predicted_DTF)^2)
## [1] 226005
# Calculate the total Sum of Squares
sum((xx$DTF - mean(xx$DTF))^2)
## [1] 63103.93
\(SS_{residuals} > SS_{total}\)
<- paste("Total Sum of Squares =",
t1 round(sum((xx$DTF - mean(xx$DTF))^2)))
<- paste("Residual Sum of Squares =",
t2 round(sum((xx$DTF - xx$Predicted_DTF)^2)))
<- ggplot(xx, aes(x = Predicted_DTF, y = DTF)) +
mp1 geom_hline(yintercept = mean(xx$DTF), color = "coral", size = 2) +
geom_segment(aes(xend = Predicted_DTF, yend = Mean_DTF), alpha = 0.2) +
geom_point(alpha = 0.5) +
theme_agData() +
labs(title = "Nepal", subtitle = t1)
<- ggplot(xx, aes(x = Predicted_DTF, y = DTF)) +
mp2 geom_abline(color = "darkorchid", size = 2) +
geom_segment(aes(xend = Predicted_DTF, yend = Predicted_DTF), alpha = 0.2) +
geom_point(alpha = 0.5) +
theme_agData() +
labs(title = "", subtitle = t2, caption = myCaption)
<- ggarrange(mp1, mp2, ncol = 2, align = "h")
mp ggsave("correlation_coefficients_07.png", mp, width = 8, height = 4)
Effect of Range
Another factor to keep in mind, when considering R2, is the effect of the range of the data.
<- d1 %>% filter(MacroEnv == "Macroenvironment 1") %>% mutate(Range = "Low")
x1 <- x1 %>% mutate(DTF = DTF + 20, DTM = DTM + 20, Range = "High",
x2 MacroEnv = "Macroenvironment 2")
<- bind_rows(x1, x2) %>% mutate(Range = "Both")
x3 <- bind_rows(x1, x2, x3) %>%
xx mutate(Range = factor(Range, levels = c("Low","High","Both")))
<- ggplot(xx, aes(x = DTF, y = DTM)) +
mp geom_point(aes(color = MacroEnv)) +
geom_smooth(method = "lm", se = F) +
stat_poly_eq(formula = y ~ x, parse = T, rr.digits = 3) +
facet_grid(.~Range) +
theme_agData(legend.position = "none") +
labs(caption = myCaption)
ggsave("correlation_coefficients_08.png", mp, width = 6, height = 4)
RMSE
The The Root-Mean-Square Error (RMSE) is a measure of the differences between observed and predicted, or an average deviation of the predicted vs observed values.
\(RMSE=\frac{\sum (o-p^2}{n}\)
- \(o\) = Observed value
- \(p\) = Predicted value
- \(n\) = Number of observations
# Create RMSE function
<- function(o, p) {
modelRMSE sqrt(sum((o-p)^2) / (length(o)))
}# Calculate RMSE for Ro17
<- d2 %>% filter(Expt == "Saskatchewan")
xx modelRMSE(xx$DTF, xx$Predicted_DTF)
## [1] 1.610582
# Calculate RMSE for Ne17
<- d2 %>% filter(Expt == "Nepal")
xx modelRMSE(xx$DTF, xx$Predicted_DTF)
## [1] 26.4111
interpretation: the standard deviation of the
unexplained variance in Ro17
is 0.95
and in
Ne17
is 24.4
.
Final Plots
Note: for easier interpretation, the x
and
y
axis have been swapped, since overpredictions
will be above the geom_abline
and underpredictions
below.
<- function(expts, colors) {
my_ggplot # Prep data
<- d2 %>% filter(Expt %in% expts)
xx <- round(SumOfSquaresR2(o = xx$DTF, p = xx$Predicted_DTF), 3)
r2 <- round(modelRMSE(o = xx$DTF, p = xx$Predicted_DTF), 1)
rmse <- min(c(xx$DTF, xx$Predicted_DTF))
mymin <- max(c(xx$DTF, xx$Predicted_DTF))
mymax # Plot
ggplot(xx, aes(x = DTF, y = Predicted_DTF)) +
geom_point(aes(fill = Expt), pch = 21, size = 2, alpha = 0.7) +
geom_abline(color = "blue") +
ylim(c(mymin, mymax)) +
xlim(c(mymin, mymax)) +
scale_fill_manual(values = colors) +
theme_agData(legend.position = "none") +
labs(y = "Predicted DTF", x = "Observed DTF",
title = substitute(
paste(italic("R")^2, " = ", r2, " | ", italic("RMSE"), " = ", rmse),
list(r2 = r2, rmse = rmse)))
}<- my_ggplot("Saskatchewan", "darkgreen") + facet_grid(. ~ Expt)
mp1 <- my_ggplot("Nepal", "darkorange") + facet_grid(. ~ Expt)
mp2 <- my_ggplot("Italy", "darkblue") + facet_grid(. ~ Expt) +
mp3 labs(caption = myCaption)
<- ggarrange(mp1, mp2, mp3, ncol = 3)
mp ggsave("correlation_coefficients_09.png", mp, width = 10, height = 4)
<- my_ggplot(expts = c("Saskatchewan","Nepal","Italy"),
mp colors = c("darkgreen","darkorange","darkblue")) +
theme(legend.position = "bottom")
ggsave("correlation_coefficients_10.png", mp, width = 6, height = 4)
Model Evaluation
Next we will evaluate a Photothermal Model which
describes the reciprical of DTF (RDTF
) as a linear function
of temperature and photoperiod:
\(\frac{1}{f}=a+b\overline{T}+c\overline{P}\)
where:
- \(f\) = Days from sowing to flower
(
DTF
) - \(\overline{T}\) = Mean temperature
(
T_mean
) - \(\overline{P}\) = Mean photoperiod
(
P_mean
) - \(a,b,c\) = Genotype specific constants
# Perform linear regression
<- lm(RDTF ~ T_mean + P_mean, data = d1)
myModel summary(myModel)
##
## Call:
## lm(formula = RDTF ~ T_mean + P_mean, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0038951 -0.0005957 -0.0001821 0.0004946 0.0086038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.843e-02 1.894e-03 -9.730 1.54e-12 ***
## T_mean 7.888e-04 8.688e-05 9.079 1.20e-11 ***
## P_mean 1.761e-03 1.223e-04 14.395 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002006 on 44 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8843
## F-statistic: 176.8 on 2 and 44 DF, p-value: < 2.2e-16
In this case we now have multiple independant varables and cannot
correlate RDTF
with T_mean
+
P_mean
using Pearson’s formula. Instead we will
correlate RDTF
with the Predicted_RDTF
values
that come from the model.
# Get predicted values and residual values from model
<- d1 %>% mutate(Predicted_RDTF = predict(myModel),
d1 Predicted_DTF = 1 / predict(myModel),
Residuals_RDTF = residuals(myModel),
Residuals_DTF = abs( (1/DTF) - (1/Predicted_DTF) ))
Calculate R2
# Calculate R^2 using cor function
cor(d1$RDTF, d1$Predicted_RDTF)^2
## [1] 0.8893289
# Calculate R^2 using Pearson's formula
pearsonsR2(x = d1$RDTF, y = d1$Predicted_RDTF)
## [1] 0.8893289
# Calculate R^2 using SS formula
SumOfSquaresR2(o = d1$RDTF, p = d1$Predicted_RDTF)
## [1] 0.8893289
Each formula gives the same R2 result.
<- min(c(d1$RDTF, d1$Predicted_RDTF))
mymin <- max(c(d1$RDTF, d1$Predicted_RDTF))
mymax <- ggplot(d1, aes(x = Predicted_RDTF, y = RDTF)) +
mp geom_smooth(method = "lm", se = F, size = 2, color = "red") +
geom_abline(color = "blue") +
geom_segment(aes(yend = Predicted_RDTF, xend = Predicted_RDTF)) +
geom_point(aes(fill = MacroEnv, size = abs(Residuals_RDTF)),
pch = 21, alpha = 0.7) +
scale_fill_manual(values = c("darkgreen","darkorange","darkblue")) +
ylim(c(mymin, mymax)) +
xlim(c(mymin, mymax)) +
theme_agData(legend.position = "none") +
labs(title = substitute(paste(italic("R")^2, " = ", r2),
list(r2 = round(SumOfSquaresR2(d1$DTF, d1$Predicted_DTF), 3))),
caption = myCaption)
ggsave("correlation_coefficients_11.png", mp, width = 6, height = 4)
However, we are interested in DTF
and not
RDTF
. Let see how R2 changes when we
calculate it for DTF
instead of RDTF
.
# Calculate R^2 using cor function
cor(x = d1$DTF, y = d1$Predicted_DTF)^2
## [1] 0.8823158
# Calculate R^2 using Pearson's formula
pearsonsR2(x = d1$DTF, y = d1$Predicted_DTF)
## [1] 0.8823158
# Calculate R^2 using SS formula
SumOfSquaresR2(o = d1$DTF, p = d1$Predicted_DTF)
## [1] 0.8775759
Notice the how the values of R2 from
cor
or pearssonR2
and
SumOfSquaresR2
do not match. Why is this occuring?
<- min(c(d1$DTF, d1$Predicted_DTF))
mymin <- max(c(d1$DTF, d1$Predicted_DTF))
mymax <- ggplot(d1, aes(y = DTF, x = Predicted_DTF)) +
mp geom_smooth(method = "lm", se = F, size = 2, color = "red") +
geom_abline(color = "blue") +
geom_segment(aes(yend = Predicted_DTF, xend = Predicted_DTF)) +
geom_point(aes(fill = Expt, size = abs(Residuals_DTF)),
pch = 21, alpha = 0.7) +
ylim(c(mymin, mymax)) +
xlim(c(mymin, mymax)) +
theme_agData(legend.position = "none") +
labs(title = substitute(paste(italic("R")^2, " = ", r2),
list(r2 = round(SumOfSquaresR2(d1$DTF, d1$Predicted_DTF), 3))),
caption = myCaption)
ggsave("correlation_coefficients_12.png", mp, width = 6, height = 4)
Now that we’ve transformed the data, the geom_abline
and
geom_smooth
lines no longer perfectly overlap, which causes
the slight difference in R2.
However, this still leaves open the questions of why
geom_abline()
and geom_smooth
no longer
overlap after transforming the data? Lets try and visualize this with
test data.
# Prep data
<- data.frame(x = 1:10, y = 1:10 + 0.5) %>%
xx mutate(Residuals = abs(y-x))
<- sum((xx$x - mean(xx$x))^2)
SSt <- sum((xx$x - xx$y)^2)
SSr <- round(1 - SSr / SSt, 4)
r2 # Plot
<- ggplot(xx, aes(x = x, y = y)) +
mp1 geom_abline() +
geom_segment(aes(xend = x, yend = x)) +
geom_point(aes(size = Residuals), alpha = 0.7) +
theme_agData() +
labs(title = substitute(paste(italic("R")^2,
" = 1 - (", SSr, "/",SSt, ") = ", r2 ),
list(SSr = SSr, SSt = SSt, r2 = r2)))
# Prep data
<- xx %>% mutate(Residuals = abs((1/y)-(1/x)))
xx <- sum((1 / xx$x - mean(1 / xx$x))^2)
SSt <- sum((1 / xx$x - 1 / xx$y)^2)
SSr <- round(1 - SSr / SSt, 4)
r2 # Plot
<- ggplot(xx, aes(x = 1 / x, y = 1 / y)) +
mp2 geom_abline() +
geom_segment(aes(xend = 1 / x, yend = 1 / x)) +
geom_point(aes(size = Residuals), alpha = 0.7) +
theme_agData() +
labs(title = substitute(paste(italic("R")^2,
" = 1 - (", SSr, "/",SSt, ") = ", r2 ),
list(SSr = round(SSr, 3), SSt = round(SSt, 3), r2 = r2)),
caption = myCaption)
# Append Plots
<- ggarrange(mp1, mp2, ncol = 2, legend = "none", align = "h")
mp ggsave("correlation_coefficients_13.png", mp, width = 8, height = 4)
Visualizing the Photothermal model
<- d1$T_mean
x <- d1$P_mean
y <- d1$RDTF
z <- as.numeric(as.factor(d1$MacroEnv))
cv <- lm(z ~ x + y)
fit # Create PhotoThermal plane
<- predict(fit)
fitpoints = 12
grid.lines <- seq(min(x), max(x), length.out = grid.lines)
x.pred <- seq(min(y), max(y), length.out = grid.lines)
y.pred <- expand.grid(x = x.pred, y = y.pred)
xy <- matrix(predict(fit, newdata = xy),
z.pred nrow = grid.lines, ncol = grid.lines)
# Plot with regression plane
png("correlation_coefficients_14.png", width = 1000, height = 1000, res = 200)
par(mar=c(1.5, 2.5, 1.5, 0.5))
::scatter3D(x, y, z, pch = 18, cex = 2, zlim = c(0.005,0.03),
plot3Dcol = alpha(c("darkgreen","darkorange","darkblue"),0.5),
colvar = cv, colkey = F, col.grid = "grey", bty = "u",
theta = 40, phi = 25, ticktype = "detailed", cex.lab = 1, cex.axis = 0.5,
xlab = "Temperature", ylab = "Photoperiod", zlab = "1 / DTF",
surf = list(x = x.pred, y = y.pred, z = z.pred, col = "black",
facets = NA, fit = fitpoints), main = "PhotoThermal Model")
dev.off()