GWAS Tutorial with gwaspr
An R tutorial on running genome-wide association studies (GWAS) with GAPIT and gwaspr
Introduction
This vignette is a tutorial on running Genome-Wide
Association Studies (GWAS) in
with the packages GAPIT
and plotting the results with gwaspr
.
- GAPIT Website: https://www.maizegenetics.net/gapit
- GAPIT Manual: http://www.zzlab.net/GAPIT/gapit_help_document.pdf
- GAPIT Forum: https://groups.google.com/forum/#!forum/gapit-forum
- GAPIT3 github: https://github.com/jiabowang/GAPIT
- gwaspr Website: https://derekmichaelwright.github.io/gwaspr/
Data
myY.csv
: Phenotype filemyG_hmp.csv
: Genotype file in hapmap formatmyCV.csv
: (Optional) Covariate filemyM.csv
: (Optional) Metadata file for the genotypes.
All phenotype and genotype data is derived from and available at: https://knowpulse.usask.ca/
Associated papers:
For this tutorial, we will use a Lentil Diversity Panel (LDP) consisting of 324 Lens culinaris cultivars, representative of the global genetic diversity of cultivated lentils.
Prepare Data
# devtools::install_github("derekmichaelwright/gwaspr")
library(gwaspr)
myCaption <- "www.dblogr.com/ or derekmichaelwright.github.io/dblogr/ | Data: AGILE"
Our population (The LDP) has been phenotyped for three traits:
- Testa_Pattern: a qualitative trait describing the presence or absence of seed coat pigmentation.
- DTF_Nepal_2017: a quantitative trait describing days from sowing to flowering in a 2017 Nepal field trial.
- DTF_Sask_2017: a quantitative trait describing days from sowing to flowering in a 2017 Saskatchewan field trial.
First we must convert our qualitative trait
Testa_Pattern from character
to
numeric
format.
myY <- myY %>%
mutate(Testa_Pattern = plyr::mapvalues(Testa_Pattern,
c("Absent", "Present"), 1:2),
Testa_Pattern = as.numeric(Testa_Pattern))
DT::datatable(myY)
# Prep data
xx <- myY %>%
gather(Trait, Value, 2:ncol(.)) %>%
mutate(Trait = factor(Trait, levels = unique(Trait)))
# Plot
mp <- ggplot(xx, aes(x = Value)) +
geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
facet_wrap(. ~ Trait, ncol = 3, scales = "free") +
theme_gwaspr() +
labs(caption = myCaption)
ggsave("Figures_01_myY_Phenotypes.png", mp, width = 8, height = 3)
GWAS Results
Genotype Data
The LDP was genotyped using an exome caputure array and
filtered with the following criteria to yield a data set of 276,984
SNPs, stored in our object myG
.
- Only Bi-allelic = TRUE
- Minimum Read Depth = 3
- Minor Allele Frequency = >5%
- Maximum Missing Frequency = 20%
- Heterozygous = <20%
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 rs alleles chrom pos strand assembly center protLSID assayLSID panel QCcode
## 2 Lcu.2RBY.Chr1p58402 A/G 1 58402 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 Lcu.2RBY.Chr1p58637 G/A 1 58637 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 Lcu.2RBY.Chr1p58679 A/C 1 58679 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 Lcu.2RBY.Chr1p58694 C/G 1 58694 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 Lcu.2RBY.Chr1p75442 A/G 1 75442 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 7 Lcu.2RBY.Chr1p75504 C/T 1 75504 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 8 Lcu.2RBY.Chr1p75509 G/A 1 75509 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 9 Lcu.2RBY.Chr1p75714 A/G 1 75714 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 10 Lcu.2RBY.Chr1p76580 C/A 1 76580 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## V12 V13 V14 V15 V16 V17
## 1 X3156.11_AGL CDC_Asterix_AGL CDC_Cherie_AGL CDC_Glamis_AGL CDC_Gold_AGL CDC_Greenstar_AGL
## 2 G A G A G A
## 3 G G G A G G
## 4 C C C C C A
## 5 G G G G G C
## 6 G G N G G A
## 7 T C N C T C
## 8 G G N A G G
## 9 A A A A A A
## 10 M C M C M C
Covariates + Kinship
GAPIT will automatically calculate kinship matrices and performs PCA
on the genotype data to be used as a covariate to account for population
structure. However, custom Kinship matrices and covariates can be used
as well. Just make sure they are formatted similar to the output files
GAPIT
uses for covariates and kinship.
GWAS_Results/GAPIT.Genotype.PCA.csv
GWAS_Results/GAPIT.Genotype.Kin_Zhang.csv
## taxa PC1 PC2 PC3 PC4
## 1 X3156.11_AGL -170.7836 101.24064 -3.0070146 -21.380547
## 2 CDC_Asterix_AGL -163.4179 159.07059 5.0423668 -50.317241
## 3 CDC_Cherie_AGL -154.5389 80.35912 -1.2489458 -9.462448
## 4 CDC_Glamis_AGL -179.1593 106.34175 5.0356748 -51.139585
## 5 CDC_Gold_AGL -144.5843 128.74179 20.5079527 -21.636310
## 6 CDC_Greenstar_AGL -184.4032 80.99638 2.3650361 -56.905163
## 7 CDC_Imax_AGL -167.9899 31.88485 4.5540258 -29.114839
## 8 CDC_Impower_AGL -165.3959 93.93274 1.2153189 -56.738656
## 9 CDC_KR.1_AGL -168.3731 54.97180 -1.8331581 -29.284140
## 10 CDC_LeMay_AGL -162.2348 183.66423 -0.1152152 -50.209390
## V1 V2 V3 V4 V5
## 1 X3156.11_AGL 1.4251644 0.7143993 0.9062549 0.7950806
## 2 CDC_Asterix_AGL 0.7143993 1.4281958 0.6417335 0.6974319
## 3 CDC_Cherie_AGL 0.9062549 0.6417335 1.3657722 0.7897873
## 4 CDC_Glamis_AGL 0.7950806 0.6974319 0.7897873 1.4109575
## 5 CDC_Gold_AGL 0.7776084 0.6963957 0.7141069 1.1412520
## 6 CDC_Greenstar_AGL 0.8136061 0.6937040 0.7682247 1.1391101
## 7 CDC_Imax_AGL 0.8403247 0.6282010 0.7624469 0.9632586
## 8 CDC_Impower_AGL 0.7618107 0.7807035 0.7920685 1.0158991
## 9 CDC_KR.1_AGL 0.9069573 0.5858560 1.0316843 0.9689623
## 10 CDC_LeMay_AGL 0.6589304 1.0923155 0.6026867 0.6748768
Common Errors
There are a number of common errors one might encounter when attempting to do GWAS with GAPIT. We will go through a few of these:
Incorect data uploading
myG <- read.csv("Genotype_Data.csv", header = T)
when
it should be header = F
Notice how the first row of our genotype data is actually the column names.
Numeric Phenotype Data
For our qualitative trait Testa_Pattern, we needed to convert
Absent
to1
Present
to2
Genotype Names
Genotype names not matching between the phenotype file and genotype
file is an easy error to miss. R
does not like column names
that begin with a number. E.g., the genotype
3156-11_AGL
should have its named changed to
X3156-11_AGL
. Another common problem can come from dashses
(-
) and periods (.
), E.g..
CDC_QG.1_AGL
or CDC_QG-1_AGL
, or spaces
() and underscores (
_
), E.g.,
CDC Asterix AGL
or CDC_Asterix_AGL
.
## [1] "CDC Asterix AGL" "3156-11_AGL" "CDC_QG-1_AGL"
## [1] "rs" "alleles" "chrom" "pos" "strand" "assembly" "center" "protLSID"
## [9] "assayLSID" "panel" "QCcode" "X3156.11_AGL" "CDC_Asterix_AGL" "CDC_QG.1_AGL"
myY <- myY %>%
mutate(Name = plyr::mapvalues(Name,
c("3156-11_AGL", "CDC Asterix AGL", "CDC_QG-1_AGL"),
c("X3156.11_AGL", "CDC_Asterix_AGL", "CDC_QG.1_AGL")))
## character(0)
## [1] "rs" "alleles" "chrom" "pos" "strand" "assembly" "center" "protLSID" "assayLSID" "panel" "QCcode"
Run GWAS with GAPIT
Install GAPIT
GWAS Models
GLM
General Linear Model: uses population structure (Q) as a cofactor.MLM
Mixed Linear Model: adds Kinship (K) as a random effect. AKA a Q+K model.CMLM
Compressed Mixed Linear Model: reduces the confounding effect between the testing markers and the individuals genetic effects by replacing them with their corresponding groups based on cluster analysis defined by the user. MLM is the equivelent of each individual being its own group, and GLM is the equivlent of having only one group for all individuals.MLMM
Multiple Locus Mixed Linear Model: uses forward-backward stepwise linear mixed-model regression to include associated markers as covariates.SUPER
Settlement of MLM Under Progressively Exclusive Relationship: uses a FaST-LMM algorithm to solve a computational burden for large samples. K is derived from a smaller number of associated markers.FarmCPU
Fixed and random model Circulating Probability Unification: uses iterations to fit associated markers selected using a maximum likelhood methodas cofactors.BLINK
BLINK: similar to FarmCPU but eliminates the assumption of causal genes are evenly distributed accross the genome. Markers that are in LD with the most significant marker are excluded. Also uses a different method to determine markers for K (BIC).
Run GWAS
For our purposes we will run GWAS with the following models:
GLM
, MLM
, MLMM
,
FarmCPU
, BLINK
.
dir.create("GWAS_Results/")
setwd("GWAS_Results/")
# MLM
GWAS <- GAPIT( Y = myY, G = myG, PCA.total = 4, model = "MLM")
# MLMM
GWAS <- GAPIT( Y = myY, G = myG, PCA.total = 4, model = "MLMM")
# FarmCPU
GWAS <- GAPIT( Y = myY, G = myG, PCA.total = 4, model = "FarmCPU")
# BLINK
GWAS <- GAPIT( Y = myY, G = myG, PCA.total = 4, model = "BLINK")
# GLM
GWAS <- GAPIT( Y = myY, G = myG, PCA.total = 4, model = "GLM")
or
GWAS <- GAPIT(
Y = myY,
G = myG,
PCA.total = 4,
model = c("GLM"", MLM", "MLMM", "FarmCPU", "BLINK") )
PCA.total = ?
why set PCA.total = 4
? Principle components of the
marker data are used to account for population structure. By plotting
our eigenvalues by the number of principle components we can use the
“kink method” to choose how many PCs to use in our GWAS. From PC4 to
PC5, there is little change in the eigenvalues, meaning that additional
PCs will not account for more of the variance within our data (i.e.,
they are unnecessary).
However, in our case, we are using an output (the PCA of our marker
data) from the GWAS to make this decision. If this is your first time
running a GWAS, setting PCA.total = 3
is good practice, as
your GWAS can be easily rerun if you later determine you need to adjust
settings.
xx <- read.csv("GWAS_Results/GAPIT.Genotype.PCA_eigenvalues.csv") %>% slice(1:10) %>%
mutate(y = x, x = 1:10)
mp <- ggplot(xx, aes(x = x, y = y)) +
geom_line(size = 1) +
geom_point(size = 3, pch = 21, fill = "darkgreen", alpha = 0.7) +
geom_point(data = xx %>% slice(4), size = 3, pch = 21, fill = "darkred") +
scale_x_continuous(breaks = 1:10) +
theme_gwaspr() +
labs(x = "PC", y = "Eigen Values", caption = myCaption)
ggsave("Figures_02_Eigen_Values.png", width = 6, height = 3)
User-inputted Kinship and Covariates
We will rerun GWAS on the DTF traits using the b coefficient from a photothermal model derived from Wright et al. (2020). This should act to remove or diminish temperature related associations, and thus emphasize photoperiod related associations.
## Name b
## 1 CDC_Asterix_AGL 3.372
## 2 CDC_Rosie_AGL 3.525
## 3 X3156.11_AGL 3.563
## 4 CDC_Greenstar_AGL 4.293
## 5 CDC_Cherie_AGL 3.948
## 6 CDC_Glamis_AGL 3.825
## 7 CDC_Gold_AGL 4.147
## 8 CDC_Imax_AGL 2.798
## 9 CDC_Impower_AGL 3.828
## 10 CDC_KR.1_AGL 5.881
## 11 CDC_LeMay_AGL 3.547
## 12 CDC_Maxim_AGL 3.722
## 13 CDC_QG.1_AGL 4.071
## 14 CDC_Red_Rider_AGL 3.866
## 15 CDC_Redcoat_AGL 2.813
## 16 CDC_Redwing_AGL 2.843
## 17 CDC_Robin_AGL 3.978
## 18 CDC_Rosebud_AGL 5.959
## 19 CDC_Rosetown_AGL 3.348
## 20 CDC_Rouleau_AGL 3.347
dir.create("GWAS_Results_CV_b/")
setwd("GWAS_Results_CV_b/")
# set PCA.total = 0
# MLM
GWAS <- GAPIT(Y = myY[,c("Name","DTF_Sask_2017")], model = "MLM",
G = myG, K = myK, CV = myCV, PCA.total = 0)
# MLMM
GWAS <- GAPIT(Y = myY[,c("Name","DTF_Sask_2017")], model = "MLMM",
G = myG, K = myK, CV = myCV, PCA.total = 0)
# FarmCPU
GWAS <- GAPIT(Y = myY[,c("Name","DTF_Sask_2017")], model = "FarmCPU",
G = myG, K = myK, CV = myCV, PCA.total = 0)
# BLINK
GWAS <- GAPIT(Y = myY[,c("Name","DTF_Sask_2017")], model = "BLINK",
G = myG, K = myK, CV = myCV, PCA.total = 0)
# GLM
GWAS <- GAPIT(Y = myY[,c("Name","DTF_Sask_2017")], model = "GLM",
G = myG, K = myK, CV = myCV, PCA.total = 0)
Post GWAS Preparation
After running a genome-wide association study, there are some quality control and post-GWAS steps we can take to evaluate our results and identify false positives.
Prepare Data
Convert our Testa_Pattern
back to its character
values.
myY <- myY %>%
mutate(Testa_Pattern = plyr::mapvalues(Testa_Pattern, c(1,2), c("Absent","Present")))
Reread in our myG
file but with
header = T
.
Create table for DNA codes.
dna <- data.frame(stringsAsFactors = F,
Symbol = c("A", "C", "G", "T", "U", "R", "Y", "S", "W", "K", "M", "N"),
Value = c("AA","CC","GG","TT","UU","AG","CT","GC","AT","GT","AC","NN") )
knitr::kable(t(dna))
Symbol | A | C | G | T | U | R | Y | S | W | K | M | N |
Value | AA | CC | GG | TT | UU | AG | CT | GC | AT | GT | AC | NN |
Significance threshold
- suggestive association = \(p<5*10^{-6}=0.000005\) -> -log10(0.000005) = 5.3
- bonferroni threshold = \(p<5*10^{-8}=0.00000005\) -> -log10(0.00000005) = 7.3
- custom threshold = \(p<0.05/n(markers)=0.05/276984=0.0000001805158\) -> -log10(0.05/276984) = 6.7
Calculate significance threshold
## [1] 5.30103
## [1] 7.30103
## [1] 6.728912
## [1] 7.427882
Testa Pattern
Since this is a simpler qualitative trait, we can assume there should
be 1 large effect QTL. Because of this, we can also assume that the
MLMM
FarmCPU
and BLINK
methods
should produce a number of false positives.
Manhattan Plots
mp <- gg_Manhattan(folder = "GWAS_Results/",
trait = "Testa_Pattern",
markers = c("Lcu.2RBY.Chr6p12212845",
"Lcu.2RBY.Chr6p12192948",
"Lcu.2RBY.Chr6p14410759"),
labels = c("845","948","759"),
vline.colors = rep("green", 3),
facet = F, vline.legend = T,
models = c("MLM","MLMM","FarmCPU","BLINK") ) %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_Testa_Pattern_01_01.png", mp, width = 12, height = 4, bg = "white")
mp <- gg_Manhattan(folder = "GWAS_Results/",
trait = "Testa_Pattern",
markers = c("Lcu.2RBY.Chr6p12212845",
"Lcu.2RBY.Chr6p12192948",
"Lcu.2RBY.Chr6p14410759"),
labels = c("12212845","12192948","14410759"),
vline.colors = rep("green", 3),
facet = T, vline.legend = T) %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_Testa_Pattern_02_01.png", mp, width = 12, height = 10, bg = "white")
The MLM method appears to be the most appropriate with the lowest amount of potential false positives. The Additive model produced one very strong association peak, while the Dominant Model produced two.
mp <- gg_Manhattan(folder = "GWAS_Results/",
trait = "Testa_Pattern",
title = "Testa_Pattern - A qualatative trait",
markers = c("Lcu.2RBY.Chr6p12213490",
"Lcu.2RBY.Ch6p12212845",
"Lcu.2RBY.Chr2p383533077"),
labels = c("12192948", "12212845", "383533077"),
vline.colors = rep("green", 3),
facet = T, vline.legend = T,
models = "MLM") %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_Testa_Pattern_03_01.png", mp, width = 12, height = 4, bg = "white")
Markers
list_Top_Markers(trait = "Testa_Pattern", model = "FarmCPU",
folder = "GWAS_Results/", chroms = 6, n = 1)
## SNP Chr Pos -log10(p)
## 1 Lcu.2RBY.Chr6p12192948 6 12192948 26.07
list_Top_Markers(trait = "Testa_Pattern", model = "BLINK",
folder = "GWAS_Results/", chroms = 6, n = 1)
## SNP Chr Pos -log10(p)
## 1 Lcu.2RBY.Chr6p14410759 6 14410759 30.02
list_Top_Markers(trait = "Testa_Pattern", model = "MLM",
folder = "GWAS_Results/", chroms = c(2, 6), n = 1)
## SNP Chr Pos -log10(p)
## 1 Lcu.2RBY.Chr2p383533077 2 383533077 7.53
## 2 Lcu.2RBY.Chr6p12212845 6 12212845 11.55
mp <- gg_Marker_Bar(myG = myG, myY = myY, trait = "Testa_Pattern",
markers = "Lcu.2RBY.Chr6p12213490") +
labs(caption = myCaption)
ggsave("Figures_Testa_Pattern_04_01.png", mp, width = 6, height = 4)
The peak on Chr 2 is a false positive.
mp <- gg_Marker_Bar(myG = myG, myY = myY, trait = "Testa_Pattern",
markers = "Lcu.2RBY.Chr2p383533077") +
labs(caption = myCaption)
ggsave("Figures_Testa_Pattern_04_02.png", mp, width = 6, height = 4)
Chr 6 Association
mp <- gg_Manhattan_Zoom(folder = "GWAS_Results/",
trait = "Testa_Pattern",
chr = 6,
markers = "Lcu.2RBY.Chr6p12192948",
labels = "12192948",
vline.color = "green",
start = 12192948 - 3000000,
end = 12192948 + 3000000) +
labs(caption = myCaption)
ggsave("Figures_Testa_Pattern_05_01.png", mp, width = 6, height = 10)
DTF Nepal 2017
Since Days To Flower (DTF) is a quantitative trait, we can expect many smaller effect QTLs.
Manhattan Plots
mp <- gg_Manhattan(folder = "GWAS_Results/",
trait = "DTF_Nepal_2017",
markers = c("Lcu.2RBY.Chr2p42543877",
"Lcu.2RBY.Chr5p1069654"),
labels = c("877", "654"),
vline.colors = rep("red", 2),
facet = F, vline.legend = T,
models = c("MLM","MLMM","FarmCPU","BLINK")) %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_DTF_Nepal_2017_01_01.png", mp, width = 12, height = 4, bg = "white")
mp <- gg_Manhattan(folder = "GWAS_Results/",
trait = "DTF_Nepal_2017",
markers = c("Lcu.2RBY.Chr2p42543877",
"Lcu.2RBY.Chr5p1069654"),
labels = c("42543877", "1069654"),
vline.colors = rep("red", 2),
facet = T, vline.legend = T) %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_DTF_Nepal_2017_02_01.png", mp, width = 12, height = 10, bg = "white")
mp <- gg_Manhattan(folder = "GWAS_Results/",
trait = "DTF_Nepal_2017",
markers = c("Lcu.2RBY.Chr2p42543877",
"Lcu.2RBY.Chr5p1069654"),
labels = c("877", "654"),
vline.colors = rep("red", 2),
facet = T,
models = "MLM") %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_DTF_Nepal_2017_03_01.png", mp, width = 12, height = 4, bg = "white")
Markers
2 strong association peaks can be identified on chromosomes 2 and 5.
list_Top_Markers(trait = "DTF_Nepal_2017",
model = "MLM",
folder = "GWAS_Results/",
threshold = 6.7,
chroms = c(2,5), n = 2)
## SNP Chr Pos -log10(p)
## 1 Lcu.2RBY.Chr2p42556949 2 42556949 6.90
## 2 Lcu.2RBY.Chr2p42543877 2 42543877 7.12
## 3 Lcu.2RBY.Chr5p1061938 5 1061938 10.18
## 4 Lcu.2RBY.Chr5p1069654 5 1069654 10.39
mp <- gg_Marker_Box(myG = myG, myY = myY,
trait = "DTF_Nepal_2017",
markers = "Lcu.2RBY.Chr2p42543877") +
labs(caption = myCaption)
ggsave("Figures_DTF_Nepal_2017_04_01.png", mp, width = 6, height = 4)
mp <- gg_Marker_Box(myG = myG, myY = myY,
trait = "DTF_Nepal_2017",
markers = "Lcu.2RBY.Chr5p1069654") +
labs(caption = myCaption)
ggsave("Figures_DTF_Nepal_2017_04_02.png", mp, width = 6, height = 4)
Now lets plot both markers
mp <- gg_Marker_Box(myG = myG, myY = myY, trait = "DTF_Nepal_2017",
markers = c("Lcu.2RBY.Chr2p42543877", "Lcu.2RBY.Chr5p1069654")) +
labs(caption = myCaption)
ggsave("Figures_DTF_Nepal_2017_04_03.png", mp, width = 6, height = 4)
Chr 2 Association
mp <- gg_Manhattan_Zoom(folder = "GWAS_Results/",
trait = "DTF_Nepal_2017",
chr = 2,
markers = "Lcu.2RBY.Chr2p42543877",
labels = "42543877",
vline.color = "red",
start = 42543877 - 4000000,
end = 42543877 + 4000000) +
labs(caption = myCaption)
ggsave("Figures_DTF_Nepal_2017_05_01.png", mp, width = 6, height = 10)
Chr 5 Association
mp <- gg_Manhattan_Zoom(folder = "GWAS_Results/",
trait = "DTF_Nepal_2017",
chr = 5,
markers = "Lcu.2RBY.Chr5p1069654",
labels = "1063138",
vline.color = "red",
start = 1063138 - 2000000,
end = 1063138 + 2000000) +
labs(caption = myCaption)
ggsave("Figures_DTF_Nepal_2017_05_02.png", mp, width = 6, height = 10)
DTF Sask 2017
Saskatchewan has drastically different temperature and photoperiods than Nepal, so we should expect different associations.
Manhattan Plots
mp <- gg_Manhattan(folder = "GWAS_Results/",
trait = "DTF_Sask_2017",
markers = c("Lcu.2RBY.Chr2p46908331",
"Lcu.2RBY.Chr6p2528817"),
labels = c("331", "817"),
vline.colors = c("red","blue"),
facet = F, vline.legend = T,
models = c("MLM","MLMM","FarmCPU","BLINK")) %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_01_01.png", mp, width = 12, height = 4, bg = "white")
mp <- gg_Manhattan(folder = "GWAS_Results_CV_b/",
trait = "DTF_Sask_2017",
title = "DTF_Sask_2017 | CV = b",
markers = c("Lcu.2RBY.Chr2p46908331",
"Lcu.2RBY.Chr6p2528817"),
labels = c("331", "817"),
vline.colors = c("red","blue"),
facet = F, vline.legend = T,
models = c("MLM","MLMM","FarmCPU","BLINK")) %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_01_02.png", mp, width = 12, height = 4, bg = "white")
mp <- gg_Manhattan(folder = "GWAS_Results/",
trait = "DTF_Sask_2017",
markers = c("Lcu.2RBY.Chr2p46908331",
"Lcu.2RBY.Chr6p2528817"),
labels = c("46908331", "2528817"),
vline.colors = c("red","blue"),
facet = T, vline.legend = T) %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_02_01.png", mp, width = 12, height = 10, bg = "white")
mp <- gg_Manhattan(folder = "GWAS_Results_CV_b/",
trait = "DTF_Sask_2017",
title = "DTF_Sask_2017 | CV = b",
markers = c("Lcu.2RBY.Chr2p46908331",
"Lcu.2RBY.Chr6p2528817"),
labels = c("46908331", "2528817"),
vline.colors = c("red","blue"),
facet = T, vline.legend = T) %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_02_02.png", mp, width = 12, height = 10, bg = "white")
mp <- gg_Manhattan(folder = "GWAS_Results/",
trait = "DTF_Sask_2017",
markers = c("Lcu.2RBY.Chr2p46908331",
"Lcu.2RBY.Chr6p2528817"),
labels = c("46908331", "2528817"),
vline.colors = c("red","blue"),
facet = T, vline.legend = T,
models = "MLMM") %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_03_01.png", mp, width = 12, height = 4, bg = "white")
mp <- gg_Manhattan(folder = "GWAS_Results_CV_b/",
trait = "DTF_Sask_2017",
title = "DTF_Sask_2017 | CV = b",
markers = c("Lcu.2RBY.Chr2p46908331",
"Lcu.2RBY.Chr6p2528817"),
labels = c("46908331", "2528817"),
vline.colors = c("red","blue"),
facet = T, vline.legend = T,
models = "FarmCPU") %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_03_02.png", mp, width = 12, height = 4, bg = "white")
Markers
list_Top_Markers(trait = "DTF_Sask_2017", model = "FarmCPU",
folder = "GWAS_Results/", n = 1, threshold = 6.7)
## SNP Chr Pos -log10(p)
## 1 Lcu.2RBY.Chr2p46908331 2 46908331 7.29
## 2 Lcu.2RBY.Chr3p414296678 3 414296678 8.06
## 3 Lcu.2RBY.Chr4p389008631 4 389008631 9.83
## 4 Lcu.2RBY.Chr5p165493259 5 165493259 7.06
## 5 Lcu.2RBY.Chr7p832815 7 832815 9.19
list_Top_Markers(trait = "DTF_Sask_2017", model = "MLM",
folder = "GWAS_Results/", chroms = 6, n = 1)
## SNP Chr Pos -log10(p)
## 1 Lcu.2RBY.Chr6p2528817 6 2528817 5.48
mp <- gg_Marker_Box(myG = myG, myY = myY, trait = "DTF_Sask_2017",
markers = "Lcu.2RBY.Chr2p46908331") +
labs(caption = myCaption)
ggsave("Figures_DTF_Sask_2017_04_01.png", mp, width = 6, height = 4)
mp <- gg_Marker_Box(myG = myG, myY = myY, trait = "DTF_Sask_2017",
markers = "Lcu.2RBY.Chr6p2528817") +
labs(caption = myCaption)
ggsave("Figures_DTF_Sask_2017_04_02.png", mp, width = 6, height = 4)
Chr 6 Association
mp <- gg_Manhattan_Zoom(folder = "GWAS_Results_CV_b/",
trait = "DTF_Sask_2017",
title = "DTF_Sask_2017 | CV = b",
chr = 6,
markers = "Lcu.2RBY.Chr6p2528817",
labels = "2528817",
vline.color = "blue",
start = 2528817 - 2000000,
end = 2528817 + 2000000) +
labs(caption = myCaption)
ggsave("Figures_DTF_Sask_2017_05_01.png", mp, width = 6, height = 10)
Volcano Plot
mp <- gg_Volcano(folder = "GWAS_Results/", trait = "DTF_Sask_2017",
markers = "Lcu.2RBY.Chr6p2528817",
labels = "Chr6p2528817") +
labs(caption = myCaption)
ggsave("Figures_DTF_Sask_2017_06_01.png", mp, width = 12, height = 4)
mp <- gg_Volcano(folder = "GWAS_Results_CV_b/", trait = "DTF_Sask_2017",
markers = "Lcu.2RBY.Chr6p2528817",
labels = "Chr6p2528817",
title = "DTF_Sask_2017 | CV = b") +
labs(caption = myCaption)
ggsave("Figures_DTF_Sask_2017_06_02.png", mp, width = 12, height = 4)
Heterozygosity
# For each marker
myGD <- myG
myGD$Major <- substr(myG$alleles, 1, 1)
myGD$Minor <- substr(myG$alleles, 3, 3)
#
myG_MAF <- function(x) { sum(x %in% x[1], na.rm = T)-1 }
myG_Het <- function(x) { sum(x %in% c("R","Y","S","W","K","M"), na.rm = T) }
myG_Homo <- function(x) { sum(x %in% c("A", "C", "G", "T"), na.rm = T) }
myG_N <- function(x) { sum(x %in% "N", na.rm = T) }
myGD <- myGD %>%
mutate(numMajor = apply(myGD[,c("Major", as.character(myY$Name))], 1, myG_MAF),
numMinor = apply(myGD[,c("Minor", as.character(myY$Name))], 1, myG_MAF),
numN = apply(myGD[,as.character(myY$Name)], 1, myG_N),
numHet = apply(myGD[,as.character(myY$Name)], 1, myG_Het),
numHomo = apply(myGD[,as.character(myY$Name)], 1, myG_Homo),
#
Het = numHet / (numHet + numHomo),
MAF = (numMinor*2 + numHet) / (numMinor*2 + numMajor*2 + numHet*2),
MCF = numN / (numHet + numHomo))
myGD <- myGD %>% select(rs, alleles, Major, Minor, chrom, pos, numMajor, numMinor,
numHet, numN, numHomo, Het, MAF, MCF)
write.csv(myGD, "myG_Details.csv", row.names = F)
# Plot
mp1 <- ggplot(myGD, aes(x = Het * 100)) +
geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
theme_gwaspr() +
labs(title = "Marker Details",
subtitle = "Heterozygosity", x = NULL, y = NULL)
mp2 <- ggplot(myGD, aes(x = MAF * 100)) +
geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
theme_gwaspr() +
labs(subtitle = "Minor Allele Frequency", x = "Percent", y = NULL)
mp3 <- ggplot(myGD, aes(x = MCF * 100)) +
geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
theme_gwaspr() +
labs(subtitle = "Missing Call Frequency", x = NULL, y = NULL)
mp <- ggpubr::ggarrange(mp1, mp2, mp3, ncol = 3, align = "h") %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_03_myG_Details.png", mp, width = 12, height = 4)
# For each genotype
myYD <- myY %>% mutate(Het = NA, MAF = NA, MCF = NA)
#i<-12
for(i in (12:ncol(myG))) {
xx <- myG[,i]
numHet <- sum(xx %in% c("R","Y","S","W","K","M"), na.rm = T)
numHomo <- sum(xx %in% c("A", "C", "G", "T"), na.rm = T)
numMiss <- sum(xx %in% "N", na.rm = T)
numMinor <- sum(xx == myGD$Minor, na.rm = T)
numMajor <- sum(xx == myGD$Major, na.rm = T)
#
myYD$Het[myYD$Name == colnames(myG)[i]] <- numHet / (numHet + numHomo)
myYD$MAF[myYD$Name == colnames(myG)[i]] <- (numMinor*2 + numHet) / (numMinor*2 + numMajor*2 + numHet*2)
myYD$MCF[myYD$Name == colnames(myG)[i]] <- numMiss / (numHet + numHomo + numMiss)
}
write.csv(myYD, "myY_Details.csv", row.names = F)
# Plot
mp1 <- ggplot(myYD, aes(x = Het * 100)) +
geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
theme_gwaspr() +
labs(title = "Genotype Details",
subtitle = "Heterozygosity", x = NULL, y = NULL)
mp2 <- ggplot(myYD, aes(x = MAF * 100)) +
geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
theme_gwaspr() +
labs(subtitle = "Minor Allele Frequency", x = "Percent", y = NULL)
mp3 <- ggplot(myYD, aes(x = MCF * 100)) +
geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
theme_gwaspr() +
labs(subtitle = "Missing Call Frequency", x = NULL, y = NULL)
mp <- ggpubr::ggarrange(mp1, mp2, mp3, ncol = 3, align = "h") %>%
annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
fig.lab = myCaption)
ggsave("Figures_04_myY_Details.png", mp, width = 12, height = 4)
Minor Allele Frequency
xx <- read.csv("GWAS_Results/GAPIT.Association.GWAS_Results.FarmCPU.Testa_Pattern.csv")
mp <- ggplot(xx, aes(x = MAF, y = -log10(P.value))) +
geom_hline(yintercept = -log10(0.05 / (nrow(myG)-1)),
color = "red", alpha = 0.7) +
geom_point(pch = 21, color = "black", fill = "darkgreen", alpha = 0.5) +
facet_grid("FarmCPU" ~ .) +
theme_gwaspr() +
labs(title = "Testa_Pattern", y = "-log10(p)", caption = myCaption)
ggsave("Figures_05_MAF_Testa_Pattern.png", mp, width = 6, height = 4)
xx <- read.csv("GWAS_Results/GAPIT.Association.GWAS_Results.FarmCPU.DTF_Nepal_2017.csv")
mp <- ggplot(xx, aes(x = MAF, y = -log10(P.value))) +
geom_hline(yintercept = -log10(0.05 / (nrow(myG)-1)),
color = "red", alpha = 0.7) +
geom_point(pch = 21, color = "black", fill = "darkgreen", alpha = 0.5) +
facet_grid("FarmCPU" ~ .) +
theme_gwaspr() +
labs(title = "DTF_Nepal_2017", y = "-log10(p)", caption = myCaption)
ggsave("Figures_05_MAF_DTF_Nepal_2017.png", mp, width = 6, height = 4)
xx <- read.csv("GWAS_Results_CV_b/GAPIT.Association.GWAS_Results.MLM.DTF_Sask_2017.csv")
mp <- ggplot(xx, aes(x = MAF, y = -log10(P.value))) +
geom_hline(yintercept = -log10(0.05 / (nrow(myG)-1)),
color = "red", alpha = 0.7) +
geom_point(pch = 21, color = "black", fill = "darkgreen", alpha = 0.5) +
facet_grid("MLM" ~ .) +
theme_gwaspr() +
labs(title = "DTF_Sask_2017", y = "-log10(p)", caption = myCaption)
ggsave("Figures_05_MAF_DTF_Sask_2017.png", mp, width = 6, height = 4)
xx <- read.csv("GWAS_Results_CV_b/GAPIT.Association.GWAS_Results.MLM.DTF_Sask_2017.csv")
mp <- ggplot(xx, aes(x = MAF, y = -log10(P.value))) +
geom_hline(yintercept = -log10(0.05 / (nrow(myG)-1)),
color = "red", alpha = 0.7) +
geom_point(pch = 21, color = "black", fill = "darkgreen", alpha = 0.5) +
facet_grid("MLMM" ~ .) +
theme_gwaspr() +
labs(title = "DTF_Sask_2017_CV_b", y = "-log10(p)", caption = myCaption)
ggsave("Figures_05_MAF_DTF_Sask_2017_CV_b.png", mp, width = 6, height = 4)
PCA
library(plotly)
library(htmlwidgets)
xx <- read.csv("GWAS_Results/GAPIT.Genotype.PCA.csv") %>%
left_join(myY, by = c("taxa"="Name"))
Testa Pattern
https://derekmichaelwright.github.io/dblogr/academic/gwas_tutorial/Figures_PCA_Testa_Pattern.html
myColors <- c("darkgoldenrod3", "darkblue")
mp <- plot_ly(xx, x = ~PC1, y = ~PC2, z = ~PC3, text = ~taxa,
color = ~Testa_Pattern, colors = myColors) %>%
add_markers() %>%
layout(scene = list(title = "Testa Pattern"))
saveWidget(as_widget(mp), "Figures_PCA_Testa_Pattern.html")
mp
DTF Nepal 2017
https://derekmichaelwright.github.io/dblogr/academic/gwas_tutorial/Figures_PCA_DTF_Nepal_2017.html
myColors <- c("darkred", "darkgoldenrod3", "darkblue")
mp <- plot_ly(xx, x = ~PC1, y = ~PC2, z = ~PC3, text = ~taxa,
color = ~DTF_Nepal_2017, colors = myColors) %>%
add_markers() %>%
layout(scene = list(title = "DTF Nepal 2017"))
saveWidget(as_widget(mp), "Figures_PCA_DTF_Nepal_2017.html")
mp
DTF Sask 2017
https://derekmichaelwright.github.io/dblogr/academic/gwas_tutorial/Figures_PCA_DTF_Sask_2017.html
mp <- plot_ly(xx, x = ~PC1, y = ~PC2, z = ~PC3, text = ~taxa,
color = ~DTF_Sask_2017, colors = myColors) %>%
add_markers() %>%
layout(scene = list(title = "DTF Sask 2017"))
saveWidget(as_widget(mp), "Figures_PCA_DTF_Sask_2017.html")
mp
SNP Density Plot
Package Source: https://github.com/YinLiLin/R-CMplot
library(CMplot)
xx <- myG %>% select(rs, chrom, pos)
CMplot(xx, plot.type="d", bin.size=1e7,
chr.den.col=c("darkgreen", "darkgoldenrod3", "darkred"),
file="jpg", file.output=T, verbose=F, width=9, height=6,
file.name = "LDP")