dblogr.com/

GWAS Tutorial with gwaspr

An R tutorial on running genome-wide association studies (GWAS) with GAPIT and gwaspr


Introduction

This vignette is a tutorial on running Genome-Wide Association Studies (GWAS) in with the packages GAPIT and plotting the results with gwaspr.


Data

  • myY.csv: Phenotype file
  • myG_hmp.csv: Genotype file in hapmap format
  • myCV.csv: (Optional) Covariate file
  • myM.csv: (Optional) Metadata file for the genotypes.

All phenotype and genotype data is derived from and available at: https://knowpulse.usask.ca/

Associated papers:

For this tutorial, we will use a Lentil Diversity Panel (LDP) consisting of 324 Lens culinaris cultivars, representative of the global genetic diversity of cultivated lentils.

Origins of the Lentil Diversity Panel (LDP).
Origins of the Lentil Diversity Panel (LDP).

Prepare Data

# devtools::install_github("derekmichaelwright/gwaspr")
library(gwaspr)
myCaption <- "www.dblogr.com/ or derekmichaelwright.github.io/dblogr/ | Data: AGILE"

Our population (The LDP) has been phenotyped for three traits:

  • Testa_Pattern: a qualitative trait describing the presence or absence of seed coat pigmentation.
  • DTF_Nepal_2017: a quantitative trait describing days from sowing to flowering in a 2017 Nepal field trial.
  • DTF_Sask_2017: a quantitative trait describing days from sowing to flowering in a 2017 Saskatchewan field trial.
myY <- read.csv("myY.csv")
DT::datatable(myY)

First we must convert our qualitative trait Testa_Pattern from character to numeric format.

myY <- myY %>% 
  mutate(Testa_Pattern = plyr::mapvalues(Testa_Pattern, 
                                         c("Absent", "Present"), 1:2),
         Testa_Pattern = as.numeric(Testa_Pattern))
DT::datatable(myY)

# Prep data
xx <- myY %>%
  gather(Trait, Value, 2:ncol(.)) %>%
  mutate(Trait = factor(Trait, levels = unique(Trait)))
# Plot
mp <- ggplot(xx, aes(x = Value)) + 
  geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) + 
  facet_wrap(. ~ Trait, ncol = 3, scales = "free") +
  theme_gwaspr() +
  labs(caption = myCaption)
ggsave("Figures_01_myY_Phenotypes.png", mp, width = 8, height = 3)

GWAS Results


Genotype Data

The LDP was genotyped using an exome caputure array and filtered with the following criteria to yield a data set of 276,984 SNPs, stored in our object myG.

  • Only Bi-allelic = TRUE
  • Minimum Read Depth = 3
  • Minor Allele Frequency = >5%
  • Maximum Missing Frequency = 20%
  • Heterozygous = <20%
myG <- read.csv("myG_hmp.csv", header = F)
myG[1:10,1:11] # Map + marker Info
##                     V1      V2    V3    V4     V5       V6     V7       V8        V9   V10    V11
## 1                   rs alleles chrom   pos strand assembly center protLSID assayLSID panel QCcode
## 2  Lcu.2RBY.Chr1p58402     A/G     1 58402   <NA>     <NA>   <NA>     <NA>      <NA>  <NA>   <NA>
## 3  Lcu.2RBY.Chr1p58637     G/A     1 58637   <NA>     <NA>   <NA>     <NA>      <NA>  <NA>   <NA>
## 4  Lcu.2RBY.Chr1p58679     A/C     1 58679   <NA>     <NA>   <NA>     <NA>      <NA>  <NA>   <NA>
## 5  Lcu.2RBY.Chr1p58694     C/G     1 58694   <NA>     <NA>   <NA>     <NA>      <NA>  <NA>   <NA>
## 6  Lcu.2RBY.Chr1p75442     A/G     1 75442   <NA>     <NA>   <NA>     <NA>      <NA>  <NA>   <NA>
## 7  Lcu.2RBY.Chr1p75504     C/T     1 75504   <NA>     <NA>   <NA>     <NA>      <NA>  <NA>   <NA>
## 8  Lcu.2RBY.Chr1p75509     G/A     1 75509   <NA>     <NA>   <NA>     <NA>      <NA>  <NA>   <NA>
## 9  Lcu.2RBY.Chr1p75714     A/G     1 75714   <NA>     <NA>   <NA>     <NA>      <NA>  <NA>   <NA>
## 10 Lcu.2RBY.Chr1p76580     C/A     1 76580   <NA>     <NA>   <NA>     <NA>      <NA>  <NA>   <NA>
myG[1:10,12:17] # genotrype calls
##             V12             V13            V14            V15          V16               V17
## 1  X3156.11_AGL CDC_Asterix_AGL CDC_Cherie_AGL CDC_Glamis_AGL CDC_Gold_AGL CDC_Greenstar_AGL
## 2             G               A              G              A            G                 A
## 3             G               G              G              A            G                 G
## 4             C               C              C              C            C                 A
## 5             G               G              G              G            G                 C
## 6             G               G              N              G            G                 A
## 7             T               C              N              C            T                 C
## 8             G               G              N              A            G                 G
## 9             A               A              A              A            A                 A
## 10            M               C              M              C            M                 C

Covariates + Kinship

GAPIT will automatically calculate kinship matrices and performs PCA on the genotype data to be used as a covariate to account for population structure. However, custom Kinship matrices and covariates can be used as well. Just make sure they are formatted similar to the output files GAPIT uses for covariates and kinship.

GWAS_Results/GAPIT.Genotype.PCA.csv

GWAS_Results/GAPIT.Genotype.Kin_Zhang.csv

myCV <- read.csv("GWAS_Results/GAPIT.Genotype.PCA.csv")
myCV[1:10,]
##                 taxa       PC1       PC2        PC3        PC4
## 1       X3156.11_AGL -170.7836 101.24064 -3.0070146 -21.380547
## 2    CDC_Asterix_AGL -163.4179 159.07059  5.0423668 -50.317241
## 3     CDC_Cherie_AGL -154.5389  80.35912 -1.2489458  -9.462448
## 4     CDC_Glamis_AGL -179.1593 106.34175  5.0356748 -51.139585
## 5       CDC_Gold_AGL -144.5843 128.74179 20.5079527 -21.636310
## 6  CDC_Greenstar_AGL -184.4032  80.99638  2.3650361 -56.905163
## 7       CDC_Imax_AGL -167.9899  31.88485  4.5540258 -29.114839
## 8    CDC_Impower_AGL -165.3959  93.93274  1.2153189 -56.738656
## 9       CDC_KR.1_AGL -168.3731  54.97180 -1.8331581 -29.284140
## 10     CDC_LeMay_AGL -162.2348 183.66423 -0.1152152 -50.209390
myK <- read.csv("GWAS_Results/GAPIT.Genotype.Kin_Zhang.csv", header = F)
myK[1:10,1:5]
##                   V1        V2        V3        V4        V5
## 1       X3156.11_AGL 1.4251644 0.7143993 0.9062549 0.7950806
## 2    CDC_Asterix_AGL 0.7143993 1.4281958 0.6417335 0.6974319
## 3     CDC_Cherie_AGL 0.9062549 0.6417335 1.3657722 0.7897873
## 4     CDC_Glamis_AGL 0.7950806 0.6974319 0.7897873 1.4109575
## 5       CDC_Gold_AGL 0.7776084 0.6963957 0.7141069 1.1412520
## 6  CDC_Greenstar_AGL 0.8136061 0.6937040 0.7682247 1.1391101
## 7       CDC_Imax_AGL 0.8403247 0.6282010 0.7624469 0.9632586
## 8    CDC_Impower_AGL 0.7618107 0.7807035 0.7920685 1.0158991
## 9       CDC_KR.1_AGL 0.9069573 0.5858560 1.0316843 0.9689623
## 10     CDC_LeMay_AGL 0.6589304 1.0923155 0.6026867 0.6748768

Common Errors

There are a number of common errors one might encounter when attempting to do GWAS with GAPIT. We will go through a few of these:

Incorect data uploading

myG <- read.csv("Genotype_Data.csv", header = T) when it should be header = F

Notice how the first row of our genotype data is actually the column names.

Numeric Phenotype Data

For our qualitative trait Testa_Pattern, we needed to convert

  • Absent to 1
  • Present to 2

Genotype Names

Genotype names not matching between the phenotype file and genotype file is an easy error to miss. R does not like column names that begin with a number. E.g., the genotype 3156-11_AGL should have its named changed to X3156-11_AGL. Another common problem can come from dashses (-) and periods (.), E.g.. CDC_QG.1_AGL or CDC_QG-1_AGL, or spaces () and underscores (_), E.g., CDC Asterix AGL or CDC_Asterix_AGL.

setdiff(myY$Name, t(myG[1,])) # should be character(0)
## [1] "CDC Asterix AGL" "3156-11_AGL"     "CDC_QG-1_AGL"
setdiff(t(myG[1,]), myY$Name) # should show just the first 11 columns of myG
##  [1] "rs"              "alleles"         "chrom"           "pos"             "strand"          "assembly"        "center"         
##  [8] "protLSID"        "assayLSID"       "panel"           "QCcode"          "X3156.11_AGL"    "CDC_Asterix_AGL" "CDC_QG.1_AGL"
myY <- myY %>% 
  mutate(Name = plyr::mapvalues(Name, 
           c("3156-11_AGL",  "CDC Asterix AGL", "CDC_QG-1_AGL"), 
           c("X3156.11_AGL", "CDC_Asterix_AGL", "CDC_QG.1_AGL")))
setdiff(myY$Name, t(myG[1,])) # should be character(0)
## character(0)
setdiff(t(myG[1,]), myY$Name) # should show just the first 11 columns of myG
##  [1] "rs"        "alleles"   "chrom"     "pos"       "strand"    "assembly"  "center"    "protLSID"  "assayLSID" "panel"     "QCcode"

Run GWAS with GAPIT

Install GAPIT

Github: https://github.com/jiabowang/GAPIT3

devtools::install_github("jiabowang/GAPIT3", force=TRUE)

Load GAPIT

library(GAPIT3)

GWAS Models

  • GLM General Linear Model: uses population structure (Q) as a cofactor.
  • MLM Mixed Linear Model: adds Kinship (K) as a random effect. AKA a Q+K model.
  • CMLM Compressed Mixed Linear Model: reduces the confounding effect between the testing markers and the individuals genetic effects by replacing them with their corresponding groups based on cluster analysis defined by the user. MLM is the equivelent of each individual being its own group, and GLM is the equivlent of having only one group for all individuals.
  • MLMM Multiple Locus Mixed Linear Model: uses forward-backward stepwise linear mixed-model regression to include associated markers as covariates.
  • SUPER Settlement of MLM Under Progressively Exclusive Relationship: uses a FaST-LMM algorithm to solve a computational burden for large samples. K is derived from a smaller number of associated markers.
  • FarmCPU Fixed and random model Circulating Probability Unification: uses iterations to fit associated markers selected using a maximum likelhood methodas cofactors.
  • BLINK BLINK: similar to FarmCPU but eliminates the assumption of causal genes are evenly distributed accross the genome. Markers that are in LD with the most significant marker are excluded. Also uses a different method to determine markers for K (BIC).
Various GWAS models
Various GWAS models

Run GWAS

For our purposes we will run GWAS with the following models: GLM, MLM, MLMM, FarmCPU, BLINK.

dir.create("GWAS_Results/")
setwd("GWAS_Results/")
# MLM
GWAS <- GAPIT( Y = myY, G = myG, PCA.total = 4, model = "MLM")
# MLMM
GWAS <- GAPIT( Y = myY, G = myG, PCA.total = 4, model = "MLMM")
# FarmCPU
GWAS <- GAPIT( Y = myY, G = myG, PCA.total = 4, model = "FarmCPU")
# BLINK
GWAS <- GAPIT( Y = myY, G = myG, PCA.total = 4, model = "BLINK")
# GLM
GWAS <- GAPIT( Y = myY, G = myG, PCA.total = 4, model = "GLM")

or

GWAS <- GAPIT(
  Y = myY,
  G = myG,
  PCA.total = 4,
  model = c("GLM"", MLM", "MLMM", "FarmCPU", "BLINK") )

PCA.total = ?

why set PCA.total = 4? Principle components of the marker data are used to account for population structure. By plotting our eigenvalues by the number of principle components we can use the “kink method” to choose how many PCs to use in our GWAS. From PC4 to PC5, there is little change in the eigenvalues, meaning that additional PCs will not account for more of the variance within our data (i.e., they are unnecessary).

However, in our case, we are using an output (the PCA of our marker data) from the GWAS to make this decision. If this is your first time running a GWAS, setting PCA.total = 3 is good practice, as your GWAS can be easily rerun if you later determine you need to adjust settings.

xx <- read.csv("GWAS_Results/GAPIT.Genotype.PCA_eigenvalues.csv") %>% slice(1:10) %>%
  mutate(y = x, x = 1:10)
mp <- ggplot(xx,  aes(x = x, y = y)) + 
  geom_line(size = 1) + 
  geom_point(size = 3, pch = 21, fill = "darkgreen", alpha = 0.7) +
  geom_point(data = xx %>% slice(4), size = 3, pch = 21, fill = "darkred") +
  scale_x_continuous(breaks = 1:10) +
  theme_gwaspr() +
  labs(x = "PC", y = "Eigen Values", caption = myCaption)
ggsave("Figures_02_Eigen_Values.png", width = 6, height = 3)

User-inputted Kinship and Covariates

We will rerun GWAS on the DTF traits using the b coefficient from a photothermal model derived from Wright et al. (2020). This should act to remove or diminish temperature related associations, and thus emphasize photoperiod related associations.

myCV <- read.csv("myCV.csv")[,c("Name","b")] %>%
  mutate(b = round(b * 10000, 3))
myCV[1:20,]
##                 Name     b
## 1    CDC_Asterix_AGL 3.372
## 2      CDC_Rosie_AGL 3.525
## 3       X3156.11_AGL 3.563
## 4  CDC_Greenstar_AGL 4.293
## 5     CDC_Cherie_AGL 3.948
## 6     CDC_Glamis_AGL 3.825
## 7       CDC_Gold_AGL 4.147
## 8       CDC_Imax_AGL 2.798
## 9    CDC_Impower_AGL 3.828
## 10      CDC_KR.1_AGL 5.881
## 11     CDC_LeMay_AGL 3.547
## 12     CDC_Maxim_AGL 3.722
## 13      CDC_QG.1_AGL 4.071
## 14 CDC_Red_Rider_AGL 3.866
## 15   CDC_Redcoat_AGL 2.813
## 16   CDC_Redwing_AGL 2.843
## 17     CDC_Robin_AGL 3.978
## 18   CDC_Rosebud_AGL 5.959
## 19  CDC_Rosetown_AGL 3.348
## 20   CDC_Rouleau_AGL 3.347
dir.create("GWAS_Results_CV_b/")
setwd("GWAS_Results_CV_b/")
# set PCA.total = 0
# MLM
GWAS <- GAPIT(Y = myY[,c("Name","DTF_Sask_2017")], model = "MLM", 
              G = myG, K = myK, CV = myCV, PCA.total = 0)
# MLMM
GWAS <- GAPIT(Y = myY[,c("Name","DTF_Sask_2017")], model = "MLMM", 
              G = myG, K = myK, CV = myCV, PCA.total = 0)
# FarmCPU
GWAS <- GAPIT(Y = myY[,c("Name","DTF_Sask_2017")], model = "FarmCPU", 
              G = myG, K = myK, CV = myCV, PCA.total = 0)
# BLINK
GWAS <- GAPIT(Y = myY[,c("Name","DTF_Sask_2017")], model = "BLINK", 
              G = myG, K = myK, CV = myCV, PCA.total = 0)
# GLM
GWAS <- GAPIT(Y = myY[,c("Name","DTF_Sask_2017")], model = "GLM", 
              G = myG, K = myK, CV = myCV, PCA.total = 0)

Post GWAS Preparation

After running a genome-wide association study, there are some quality control and post-GWAS steps we can take to evaluate our results and identify false positives.


Prepare Data

Convert our Testa_Pattern back to its character values.

myY <- myY %>% 
  mutate(Testa_Pattern = plyr::mapvalues(Testa_Pattern, c(1,2), c("Absent","Present")))

Reread in our myG file but with header = T.

myG <- read.csv("myG_hmp.csv", header = T)

Create table for DNA codes.

dna <- data.frame(stringsAsFactors = F,
  Symbol = c("A", "C", "G", "T", "U", "R", "Y", "S", "W", "K", "M", "N"),
  Value  = c("AA","CC","GG","TT","UU","AG","CT","GC","AT","GT","AC","NN") )
knitr::kable(t(dna))
Symbol A C G T U R Y S W K M N
Value AA CC GG TT UU AG CT GC AT GT AC NN

Significance threshold

  • suggestive association = \(p<5*10^{-6}=0.000005\) -> -log10(0.000005) = 5.3
  • bonferroni threshold = \(p<5*10^{-8}=0.00000005\) -> -log10(0.00000005) = 7.3
  • custom threshold = \(p<0.05/n(markers)=0.05/276984=0.0000001805158\) -> -log10(0.05/276984) = 6.7

Calculate significance threshold

-log10(0.000005)
## [1] 5.30103
-log10(0.00000005)
## [1] 7.30103
-log10( 0.05 / (nrow(myG)-1) )
## [1] 6.728912
-log10( 0.01 / (nrow(myG)-1) )
## [1] 7.427882

Testa Pattern

Since this is a simpler qualitative trait, we can assume there should be 1 large effect QTL. Because of this, we can also assume that the MLMM FarmCPU and BLINK methods should produce a number of false positives.


Manhattan Plots

mp <- gg_Manhattan(folder = "GWAS_Results/", 
                   trait = "Testa_Pattern", 
                   markers = c("Lcu.2RBY.Chr6p12212845",
                               "Lcu.2RBY.Chr6p12192948",
                               "Lcu.2RBY.Chr6p14410759"),
                   labels = c("845","948","759"),
                   vline.colors = rep("green", 3),
                   facet = F, vline.legend = T,
                   models = c("MLM","MLMM","FarmCPU","BLINK") ) %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_Testa_Pattern_01_01.png", mp, width = 12, height = 4, bg = "white")

mp <- gg_Manhattan(folder = "GWAS_Results/", 
                   trait = "Testa_Pattern", 
                   markers = c("Lcu.2RBY.Chr6p12212845",
                               "Lcu.2RBY.Chr6p12192948",
                               "Lcu.2RBY.Chr6p14410759"),
                   labels = c("12212845","12192948","14410759"), 
                   vline.colors = rep("green", 3),
                   facet = T, vline.legend = T) %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_Testa_Pattern_02_01.png", mp, width = 12, height = 10, bg = "white")

The MLM method appears to be the most appropriate with the lowest amount of potential false positives. The Additive model produced one very strong association peak, while the Dominant Model produced two.


mp <- gg_Manhattan(folder = "GWAS_Results/", 
                   trait = "Testa_Pattern", 
                   title = "Testa_Pattern - A qualatative trait", 
                   markers = c("Lcu.2RBY.Chr6p12213490", 
                               "Lcu.2RBY.Ch6p12212845",
                               "Lcu.2RBY.Chr2p383533077"),
                   labels = c("12192948", "12212845", "383533077"),
                   vline.colors = rep("green", 3),
                   facet = T, vline.legend = T,
                   models = "MLM") %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_Testa_Pattern_03_01.png", mp, width = 12, height = 4, bg = "white")

Markers

list_Top_Markers(trait = "Testa_Pattern", model = "FarmCPU",
                 folder = "GWAS_Results/", chroms = 6, n = 1)
##                      SNP Chr      Pos -log10(p)
## 1 Lcu.2RBY.Chr6p12192948   6 12192948     26.07
list_Top_Markers(trait = "Testa_Pattern", model = "BLINK",
                 folder = "GWAS_Results/", chroms = 6, n = 1)
##                      SNP Chr      Pos -log10(p)
## 1 Lcu.2RBY.Chr6p14410759   6 14410759     30.02
list_Top_Markers(trait = "Testa_Pattern", model = "MLM",
                 folder = "GWAS_Results/", chroms = c(2, 6), n = 1)
##                       SNP Chr       Pos -log10(p)
## 1 Lcu.2RBY.Chr2p383533077   2 383533077      7.53
## 2  Lcu.2RBY.Chr6p12212845   6  12212845     11.55

mp <- gg_Marker_Bar(myG = myG, myY = myY, trait = "Testa_Pattern",
                markers = "Lcu.2RBY.Chr6p12213490") +
  labs(caption = myCaption)
ggsave("Figures_Testa_Pattern_04_01.png", mp, width = 6, height = 4)

The peak on Chr 2 is a false positive.

mp <- gg_Marker_Bar(myG = myG, myY = myY, trait = "Testa_Pattern",
                markers = "Lcu.2RBY.Chr2p383533077") +
  labs(caption = myCaption)
ggsave("Figures_Testa_Pattern_04_02.png", mp, width = 6, height = 4)

Chr 6 Association

mp <- gg_Manhattan_Zoom(folder = "GWAS_Results/", 
                        trait = "Testa_Pattern", 
                        chr = 6,
                        markers = "Lcu.2RBY.Chr6p12192948", 
                        labels = "12192948", 
                        vline.color = "green",
                        start = 12192948 - 3000000, 
                        end = 12192948 + 3000000) +
  labs(caption = myCaption)
ggsave("Figures_Testa_Pattern_05_01.png", mp, width = 6, height = 10)

Volcano Plot

mp <- gg_Volcano(folder = "GWAS_Results/", trait = "Testa_Pattern",
                 markers = c("Lcu.2RBY.Chr6p12192948", "Lcu.2RBY.Chr6p12212845"), 
                 labels = c("Chr6p12192948", "Chr6p12212845")) +
  labs(caption = myCaption)
ggsave("Figures_Testa_Pattern_06_01.png", mp, width = 12, height = 4)

DTF Nepal 2017

Since Days To Flower (DTF) is a quantitative trait, we can expect many smaller effect QTLs.

Manhattan Plots

mp <- gg_Manhattan(folder = "GWAS_Results/", 
                   trait = "DTF_Nepal_2017", 
                   markers = c("Lcu.2RBY.Chr2p42543877",
                               "Lcu.2RBY.Chr5p1069654"),
                   labels = c("877", "654"),
                   vline.colors = rep("red", 2),
                   facet = F, vline.legend = T,
                   models = c("MLM","MLMM","FarmCPU","BLINK")) %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_DTF_Nepal_2017_01_01.png", mp, width = 12, height = 4, bg = "white")

mp <- gg_Manhattan(folder = "GWAS_Results/", 
                   trait = "DTF_Nepal_2017", 
                   markers = c("Lcu.2RBY.Chr2p42543877", 
                               "Lcu.2RBY.Chr5p1069654"),
                   labels = c("42543877", "1069654"),
                   vline.colors = rep("red", 2),
                   facet = T, vline.legend = T) %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_DTF_Nepal_2017_02_01.png", mp, width = 12, height = 10, bg = "white")

mp <- gg_Manhattan(folder = "GWAS_Results/", 
                   trait = "DTF_Nepal_2017", 
                   markers = c("Lcu.2RBY.Chr2p42543877", 
                               "Lcu.2RBY.Chr5p1069654"),
                   labels = c("877", "654"),
                   vline.colors = rep("red", 2),
                   facet = T, 
                   models = "MLM") %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_DTF_Nepal_2017_03_01.png", mp, width = 12, height = 4, bg = "white")

Markers

2 strong association peaks can be identified on chromosomes 2 and 5.

list_Top_Markers(trait = "DTF_Nepal_2017", 
                 model = "MLM",
                 folder = "GWAS_Results/", 
                 threshold = 6.7, 
                 chroms = c(2,5), n = 2)
##                      SNP Chr      Pos -log10(p)
## 1 Lcu.2RBY.Chr2p42556949   2 42556949      6.90
## 2 Lcu.2RBY.Chr2p42543877   2 42543877      7.12
## 3  Lcu.2RBY.Chr5p1061938   5  1061938     10.18
## 4  Lcu.2RBY.Chr5p1069654   5  1069654     10.39

mp <- gg_Marker_Box(myG = myG, myY = myY, 
                    trait = "DTF_Nepal_2017",
                    markers = "Lcu.2RBY.Chr2p42543877") +
  labs(caption = myCaption)
ggsave("Figures_DTF_Nepal_2017_04_01.png", mp, width = 6, height = 4)

mp <- gg_Marker_Box(myG = myG, myY = myY, 
                    trait = "DTF_Nepal_2017", 
                    markers = "Lcu.2RBY.Chr5p1069654") +
  labs(caption = myCaption)
ggsave("Figures_DTF_Nepal_2017_04_02.png", mp, width = 6, height = 4)

Now lets plot both markers

mp <- gg_Marker_Box(myG = myG, myY = myY, trait = "DTF_Nepal_2017",
                markers = c("Lcu.2RBY.Chr2p42543877", "Lcu.2RBY.Chr5p1069654")) +
  labs(caption = myCaption)
ggsave("Figures_DTF_Nepal_2017_04_03.png", mp, width = 6, height = 4)

Chr 2 Association

mp <- gg_Manhattan_Zoom(folder = "GWAS_Results/", 
                        trait = "DTF_Nepal_2017", 
                        chr = 2, 
                        markers = "Lcu.2RBY.Chr2p42543877", 
                        labels = "42543877", 
                        vline.color = "red",
                        start = 42543877 - 4000000, 
                        end = 42543877 + 4000000) +
  labs(caption = myCaption)
ggsave("Figures_DTF_Nepal_2017_05_01.png", mp, width = 6, height = 10)

Chr 5 Association

mp <- gg_Manhattan_Zoom(folder = "GWAS_Results/", 
                        trait = "DTF_Nepal_2017", 
                        chr = 5, 
                        markers = "Lcu.2RBY.Chr5p1069654", 
                        labels = "1063138", 
                        vline.color = "red",
                        start = 1063138  - 2000000, 
                        end = 1063138    + 2000000) +
  labs(caption = myCaption)
ggsave("Figures_DTF_Nepal_2017_05_02.png", mp, width = 6, height = 10)

Volcano Plot

mp <- gg_Volcano(folder = "GWAS_Results/", trait = "DTF_Nepal_2017",
                 markers = c("Lcu.2RBY.Chr2p42543877", "Lcu.2RBY.Chr5p1069654"), 
                 labels = c("Chr2p42543877", "Chr5p1069654")) +
  labs(caption = myCaption)
ggsave("Figures_DTF_Nepal_2017_06_01.png", mp, width = 12, height = 4)

DTF Sask 2017

Saskatchewan has drastically different temperature and photoperiods than Nepal, so we should expect different associations.

Manhattan Plots

mp <- gg_Manhattan(folder = "GWAS_Results/", 
                   trait = "DTF_Sask_2017", 
                   markers = c("Lcu.2RBY.Chr2p46908331", 
                               "Lcu.2RBY.Chr6p2528817"),
                   labels = c("331", "817"),
                   vline.colors = c("red","blue"),
                   facet = F, vline.legend = T,
                   models = c("MLM","MLMM","FarmCPU","BLINK")) %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_01_01.png", mp, width = 12, height = 4, bg = "white")

mp <- gg_Manhattan(folder = "GWAS_Results_CV_b/", 
                   trait = "DTF_Sask_2017", 
                   title = "DTF_Sask_2017 | CV = b", 
                   markers = c("Lcu.2RBY.Chr2p46908331", 
                               "Lcu.2RBY.Chr6p2528817"),
                   labels = c("331", "817"),
                   vline.colors = c("red","blue"),
                   facet = F, vline.legend = T,
                   models = c("MLM","MLMM","FarmCPU","BLINK")) %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_01_02.png", mp, width = 12, height = 4, bg = "white")

mp <- gg_Manhattan(folder = "GWAS_Results/", 
                   trait = "DTF_Sask_2017", 
                   markers = c("Lcu.2RBY.Chr2p46908331", 
                               "Lcu.2RBY.Chr6p2528817"),
                   labels = c("46908331", "2528817"),
                   vline.colors = c("red","blue"),
                   facet = T, vline.legend = T) %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_02_01.png", mp, width = 12, height = 10, bg = "white")

mp <- gg_Manhattan(folder = "GWAS_Results_CV_b/", 
                   trait = "DTF_Sask_2017", 
                   title = "DTF_Sask_2017 | CV = b",
                   markers = c("Lcu.2RBY.Chr2p46908331", 
                               "Lcu.2RBY.Chr6p2528817"),
                   labels = c("46908331", "2528817"),
                   vline.colors = c("red","blue"),
                   facet = T, vline.legend = T) %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_02_02.png", mp, width = 12, height = 10, bg = "white")

mp <- gg_Manhattan(folder = "GWAS_Results/",
                   trait = "DTF_Sask_2017", 
                   markers = c("Lcu.2RBY.Chr2p46908331", 
                               "Lcu.2RBY.Chr6p2528817"),
                   labels = c("46908331", "2528817"),
                   vline.colors = c("red","blue"),
                   facet = T, vline.legend = T,
                   models = "MLMM") %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_03_01.png", mp, width = 12, height = 4, bg = "white")

mp <- gg_Manhattan(folder = "GWAS_Results_CV_b/", 
                   trait = "DTF_Sask_2017", 
                   title = "DTF_Sask_2017 | CV = b",
                   markers = c("Lcu.2RBY.Chr2p46908331", 
                               "Lcu.2RBY.Chr6p2528817"),
                   labels = c("46908331", "2528817"),
                   vline.colors = c("red","blue"),
                   facet = T, vline.legend = T,
                   models = "FarmCPU") %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_DTF_Sask_2017_03_02.png", mp, width = 12, height = 4, bg = "white")

Markers

list_Top_Markers(trait = "DTF_Sask_2017", model = "FarmCPU",
                 folder = "GWAS_Results/", n = 1, threshold = 6.7)
##                       SNP Chr       Pos -log10(p)
## 1  Lcu.2RBY.Chr2p46908331   2  46908331      7.29
## 2 Lcu.2RBY.Chr3p414296678   3 414296678      8.06
## 3 Lcu.2RBY.Chr4p389008631   4 389008631      9.83
## 4 Lcu.2RBY.Chr5p165493259   5 165493259      7.06
## 5    Lcu.2RBY.Chr7p832815   7    832815      9.19
list_Top_Markers(trait = "DTF_Sask_2017", model = "MLM",
                 folder = "GWAS_Results/", chroms = 6, n = 1)
##                     SNP Chr     Pos -log10(p)
## 1 Lcu.2RBY.Chr6p2528817   6 2528817      5.48

mp <- gg_Marker_Box(myG = myG, myY = myY, trait = "DTF_Sask_2017",
                    markers = "Lcu.2RBY.Chr2p46908331") +
  labs(caption = myCaption)
ggsave("Figures_DTF_Sask_2017_04_01.png", mp, width = 6, height = 4)

mp <- gg_Marker_Box(myG = myG, myY = myY, trait = "DTF_Sask_2017",
                    markers = "Lcu.2RBY.Chr6p2528817") +
  labs(caption = myCaption)
ggsave("Figures_DTF_Sask_2017_04_02.png", mp, width = 6, height = 4)

Chr 6 Association

mp <- gg_Manhattan_Zoom(folder = "GWAS_Results_CV_b/", 
                        trait = "DTF_Sask_2017", 
                        title = "DTF_Sask_2017 | CV = b", 
                        chr = 6,
                        markers = "Lcu.2RBY.Chr6p2528817", 
                        labels = "2528817", 
                        vline.color = "blue",
                        start = 2528817 - 2000000, 
                        end = 2528817 + 2000000) +
  labs(caption = myCaption)
ggsave("Figures_DTF_Sask_2017_05_01.png", mp, width = 6, height = 10)

Volcano Plot

mp <- gg_Volcano(folder = "GWAS_Results/", trait = "DTF_Sask_2017",
                 markers = "Lcu.2RBY.Chr6p2528817", 
                 labels = "Chr6p2528817") +
  labs(caption = myCaption)
ggsave("Figures_DTF_Sask_2017_06_01.png", mp, width = 12, height = 4)

mp <- gg_Volcano(folder = "GWAS_Results_CV_b/", trait = "DTF_Sask_2017",
                 markers = "Lcu.2RBY.Chr6p2528817", 
                 labels = "Chr6p2528817",
                 title = "DTF_Sask_2017 | CV = b") +
  labs(caption = myCaption)
ggsave("Figures_DTF_Sask_2017_06_02.png", mp, width = 12, height = 4)

Heterozygosity

# For each marker
myGD <- myG
myGD$Major <- substr(myG$alleles, 1, 1)
myGD$Minor <- substr(myG$alleles, 3, 3)
#
myG_MAF <- function(x) { sum(x %in% x[1], na.rm = T)-1 }
myG_Het <- function(x) { sum(x %in% c("R","Y","S","W","K","M"), na.rm = T) }
myG_Homo <- function(x) { sum(x %in% c("A", "C", "G", "T"), na.rm = T) }
myG_N <- function(x) { sum(x %in% "N", na.rm = T) }
myGD <- myGD %>%
  mutate(numMajor = apply(myGD[,c("Major", as.character(myY$Name))], 1, myG_MAF),
         numMinor = apply(myGD[,c("Minor", as.character(myY$Name))], 1, myG_MAF),
         numN     = apply(myGD[,as.character(myY$Name)], 1, myG_N),
         numHet   = apply(myGD[,as.character(myY$Name)], 1, myG_Het),
         numHomo  = apply(myGD[,as.character(myY$Name)], 1, myG_Homo),
         #
         Het = numHet / (numHet + numHomo),
         MAF = (numMinor*2 + numHet) / (numMinor*2 + numMajor*2 + numHet*2),
         MCF = numN / (numHet + numHomo))
myGD <- myGD %>% select(rs, alleles, Major, Minor, chrom, pos, numMajor, numMinor,
                        numHet, numN, numHomo, Het, MAF, MCF)
write.csv(myGD, "myG_Details.csv", row.names = F)
# Plot
mp1 <- ggplot(myGD, aes(x = Het * 100)) + 
  geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
  theme_gwaspr() +
  labs(title = "Marker Details", 
       subtitle = "Heterozygosity", x = NULL, y = NULL)
mp2 <- ggplot(myGD, aes(x = MAF * 100)) + 
  geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
  theme_gwaspr() +
  labs(subtitle = "Minor Allele Frequency", x = "Percent", y = NULL)
mp3 <- ggplot(myGD, aes(x = MCF * 100)) + 
  geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
  theme_gwaspr() +
  labs(subtitle = "Missing Call Frequency", x = NULL, y = NULL)
mp <- ggpubr::ggarrange(mp1, mp2, mp3, ncol = 3, align = "h") %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_03_myG_Details.png", mp, width = 12, height = 4)

# For each genotype
myYD <- myY %>% mutate(Het = NA, MAF = NA, MCF = NA)
#i<-12
for(i in (12:ncol(myG))) {
  xx <- myG[,i]
  numHet <- sum(xx %in% c("R","Y","S","W","K","M"), na.rm = T)
  numHomo <- sum(xx %in% c("A", "C", "G", "T"), na.rm = T)
  numMiss <- sum(xx %in% "N", na.rm = T)
  numMinor <- sum(xx == myGD$Minor, na.rm = T)
  numMajor <- sum(xx == myGD$Major, na.rm = T)
  #
  myYD$Het[myYD$Name == colnames(myG)[i]] <- numHet / (numHet + numHomo)
  myYD$MAF[myYD$Name == colnames(myG)[i]] <- (numMinor*2 + numHet) / (numMinor*2 + numMajor*2 + numHet*2)
  myYD$MCF[myYD$Name == colnames(myG)[i]] <- numMiss / (numHet + numHomo + numMiss)
}
write.csv(myYD, "myY_Details.csv", row.names = F)
# Plot
mp1 <- ggplot(myYD, aes(x = Het * 100)) + 
  geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
  theme_gwaspr() +
  labs(title = "Genotype Details", 
       subtitle = "Heterozygosity", x = NULL, y = NULL)
mp2 <- ggplot(myYD, aes(x = MAF * 100)) + 
  geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
  theme_gwaspr() +
  labs(subtitle = "Minor Allele Frequency", x = "Percent", y = NULL)
mp3 <- ggplot(myYD, aes(x = MCF * 100)) + 
  geom_histogram(color = "black", fill = "darkgreen", alpha = 0.7) +
  theme_gwaspr() +
  labs(subtitle = "Missing Call Frequency", x = NULL, y = NULL)
mp <- ggpubr::ggarrange(mp1, mp2, mp3, ncol = 3, align = "h") %>%
  annotate_figure(fig.lab.pos = "bottom.right", fig.lab.size = 6,
                  fig.lab = myCaption)
ggsave("Figures_04_myY_Details.png", mp, width = 12, height = 4)

Minor Allele Frequency

xx <- read.csv("GWAS_Results/GAPIT.Association.GWAS_Results.FarmCPU.Testa_Pattern.csv")
mp <- ggplot(xx, aes(x = MAF, y = -log10(P.value))) + 
  geom_hline(yintercept = -log10(0.05 / (nrow(myG)-1)), 
             color = "red", alpha = 0.7) +
  geom_point(pch = 21, color = "black", fill = "darkgreen", alpha = 0.5) + 
  facet_grid("FarmCPU" ~ .) +
  theme_gwaspr() +
  labs(title = "Testa_Pattern", y = "-log10(p)", caption = myCaption)
ggsave("Figures_05_MAF_Testa_Pattern.png", mp, width = 6, height = 4)

xx <- read.csv("GWAS_Results/GAPIT.Association.GWAS_Results.FarmCPU.DTF_Nepal_2017.csv")
mp <- ggplot(xx, aes(x = MAF, y = -log10(P.value))) + 
  geom_hline(yintercept = -log10(0.05 / (nrow(myG)-1)), 
             color = "red", alpha = 0.7) +
  geom_point(pch = 21, color = "black", fill = "darkgreen", alpha = 0.5) + 
  facet_grid("FarmCPU" ~ .) +
  theme_gwaspr() +
  labs(title = "DTF_Nepal_2017", y = "-log10(p)", caption = myCaption)
ggsave("Figures_05_MAF_DTF_Nepal_2017.png", mp, width = 6, height = 4)

xx <- read.csv("GWAS_Results_CV_b/GAPIT.Association.GWAS_Results.MLM.DTF_Sask_2017.csv")
mp <- ggplot(xx, aes(x = MAF, y = -log10(P.value))) + 
  geom_hline(yintercept = -log10(0.05 / (nrow(myG)-1)), 
             color = "red", alpha = 0.7) +
  geom_point(pch = 21, color = "black", fill = "darkgreen", alpha = 0.5) + 
  facet_grid("MLM" ~ .) +
  theme_gwaspr() +
  labs(title = "DTF_Sask_2017", y = "-log10(p)", caption = myCaption)
ggsave("Figures_05_MAF_DTF_Sask_2017.png", mp, width = 6, height = 4)

xx <- read.csv("GWAS_Results_CV_b/GAPIT.Association.GWAS_Results.MLM.DTF_Sask_2017.csv")
mp <- ggplot(xx, aes(x = MAF, y = -log10(P.value))) + 
  geom_hline(yintercept = -log10(0.05 / (nrow(myG)-1)), 
             color = "red", alpha = 0.7) +
  geom_point(pch = 21, color = "black", fill = "darkgreen", alpha = 0.5) + 
  facet_grid("MLMM" ~ .) +
  theme_gwaspr() +
  labs(title = "DTF_Sask_2017_CV_b", y = "-log10(p)", caption = myCaption)
ggsave("Figures_05_MAF_DTF_Sask_2017_CV_b.png", mp, width = 6, height = 4)

PCA

library(plotly)
library(htmlwidgets)
xx <- read.csv("GWAS_Results/GAPIT.Genotype.PCA.csv") %>%
  left_join(myY, by = c("taxa"="Name"))

Testa Pattern

https://derekmichaelwright.github.io/dblogr/academic/gwas_tutorial/Figures_PCA_Testa_Pattern.html

myColors <- c("darkgoldenrod3", "darkblue")
mp <- plot_ly(xx, x = ~PC1, y = ~PC2, z = ~PC3, text = ~taxa,
              color = ~Testa_Pattern, colors = myColors) %>%
  add_markers() %>%
  layout(scene = list(title = "Testa Pattern"))
saveWidget(as_widget(mp), "Figures_PCA_Testa_Pattern.html")
mp

DTF Nepal 2017

https://derekmichaelwright.github.io/dblogr/academic/gwas_tutorial/Figures_PCA_DTF_Nepal_2017.html

myColors <- c("darkred", "darkgoldenrod3", "darkblue")
mp <- plot_ly(xx, x = ~PC1, y = ~PC2, z = ~PC3, text = ~taxa,
              color = ~DTF_Nepal_2017, colors = myColors) %>%
  add_markers() %>%
  layout(scene = list(title = "DTF Nepal 2017"))
saveWidget(as_widget(mp), "Figures_PCA_DTF_Nepal_2017.html")
mp

DTF Sask 2017

https://derekmichaelwright.github.io/dblogr/academic/gwas_tutorial/Figures_PCA_DTF_Sask_2017.html

mp <- plot_ly(xx, x = ~PC1, y = ~PC2, z = ~PC3, text = ~taxa,
              color = ~DTF_Sask_2017, colors = myColors) %>%
  add_markers() %>%
  layout(scene = list(title = "DTF Sask 2017"))
saveWidget(as_widget(mp), "Figures_PCA_DTF_Sask_2017.html")
mp

SNP Density Plot

Package Source: https://github.com/YinLiLin/R-CMplot

library(CMplot)
xx <- myG %>% select(rs, chrom, pos) 
CMplot(xx, plot.type="d", bin.size=1e7, 
       chr.den.col=c("darkgreen", "darkgoldenrod3", "darkred"), 
       file="jpg", file.output=T, verbose=F, width=9, height=6,
       file.name = "LDP")


dblogr.com/


© Derek Michael Wright