This vignette shows an example of how to retrieve gene lists around QTL of interest. It’s still in progress and simply has some ideas of how this might be done.

At the moment there’s no method implemented in MagicHelpR to define what a QTL peak is (or its confidence interval). In the original MAGIC paper they have a method to define confidence intervals for the QTL, but I could not get my head around it yet.

More naively, one could define a QTL peak as a region within a certain Kb distance from the top SNP. This is the starting point in this vignette (perhaps in the future this will be improved).

Prepare data

I use the same data shown in the introduction vignette.

I will use flowering time as an example.

The code to reproduce the analysis shown in that vignette is here for convenience:

library(MagicHelpR)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: tidyr
# Download genotype and phenotype example data
#downloadArabMagic('~/temp/magic_intro/', tidy = TRUE, example_data = TRUE)

# Read phenotype data
pheno <- read.table('~/temp/magic_intro/magic_phenotype_example.txt', header = T)

# Create the MagicGenPhen object containing genotypes and phenotype
magic_phen <- magicFounderReconstruct(snp_dir = "~/temp/magic_intro/",
                                                                            phenotypes = pheno, id = "SUBJECT.NAME")
## mindist: 1e-05
## datafile /home/hugo/temp/magic_intro//all_chr.MAGIC.happy.data allelesfile /home/hugo/temp/magic_intro//all_chr.MAGIC.happy.alleles gen 7
## genotype phase: unknown
## Reading phenotype and genotype data from ped file /home/hugo/temp/magic_intro//all_chr.MAGIC.happy.data
## Number of individuals: 703  
## Number of markers:     1259 
## Number of strains:     19   
## Use Parents:           no
## Number of subjects with two parents: 0    
## Number of parents in nuclear families: 0    
## null model mean nan var nan
## assuming haploid(inbred) genotypes
## datafile /home/hugo/temp/magic_intro//all_chr.MAGIC.happy.data allelesfile /home/hugo/temp/magic_intro//all_chr.MAGIC.happy.alleles gen 7
# QTL scan for days.to.bolt trait
bolt_qtl <- scanQtl(magic_phen, "days.to.bolt")
## F-test comparing the models:
##   H1: days.to.bolt ~ GEN
##   H0: days.to.bolt ~ 1
## Elapsed time: 31 seconds

This is our QTL plot:

# Load the library and change the default theme
library(ggplot2); theme_set(theme_bw())

# Make the plot, with an horizontal line at 3.5, which is suggested in Kover et al. (2009)
ggplot(bolt_qtl, aes(bp/1e6, -log10(p))) +
  geom_line() +
  facet_grid(~ chromosome, scales = "free_x", space = "free_x") +
  geom_hline(yintercept = 3.5, linetype = "dotted")
## Warning: Removed 2 rows containing missing values (geom_path).

Retrieving genes around a QTL

Let’s say we were interested in which genes are around the peak on Chr4.

# Filter the data to retain only the peak marker on Chromosome 4
chr4_peak <- bolt_qtl %>% 
    filter(chromosome == 4) %>% 
    arrange(p) %>% 
    slice(1) %>% 
    select(marker, bp, chromosome)

chr4_peak
## # A tibble: 1 x 3
##   marker       bp chromosome
##   <I(chr)>  <dbl> <I(chr)>  
## 1 FRI_2343 270862 4

As I said above, at the moment I don’t know what the best way is to define where a peak starts and ends.

However, as a starting point, let’s consider all genes within 50Kb of the top SNP on Chr4.

# Add positions for window start and end
chr4_peak <- chr4_peak %>% 
    mutate(window_start = bp - 50e3,
                 window_end = bp + 50e3)

chr4_peak
## # A tibble: 1 x 5
##   marker       bp chromosome window_start window_end
##   <I(chr)>  <dbl> <I(chr)>          <dbl>      <dbl>
## 1 FRI_2343 270862 4                220862     320862

I will use the package biomaRt to retrieve gene information from BioMart.

library("biomaRt")

# List of BioMart plant databases
listMarts(host="plants.ensembl.org")
##             biomart                      version
## 1       plants_mart      Ensembl Plants Genes 38
## 2 plants_variations Ensembl Plants Variations 38
# Specify the plants_mart
m <- useMart("plants_mart", host="plants.ensembl.org")

# Check the Arabidopsis dataset name
listDatasets(m) %>% filter(grepl("Arabidopsis", description))
##             dataset                         description version
## 1 athaliana_eg_gene Arabidopsis thaliana genes (TAIR10)  TAIR10
## 2   alyrata_eg_gene    Arabidopsis lyrata genes (v.1.0)   v.1.0
# Define dataset
m <- useMart("plants_mart", host="plants.ensembl.org", dataset="athaliana_eg_gene")

The function getBM can then be used to retrieve genes from the BioMart database.

# Download gene list with certain attributes
## attribute list can be obtained with listAttributes(m)
biomart_genes <- getBM(attributes=c("tair_locus", "tair_symbol", "external_gene_name",
                                    "chromosome_name", "start_position", "end_position",
                                    "description"),
                       mart=m) %>% 
  filter(tair_locus != "")  # some attributes have no "tair_locus" ID!

head(biomart_genes)
##   tair_locus tair_symbol external_gene_name chromosome_name start_position
## 1  AT5G16970      AT-AER                AER               5        5575973
## 2  AT4G32100                                              4       15511757
## 3  AT2G43120                                              2       17927181
## 4  AT1G30814                                              1       10942648
## 5  AT3G18710       PUB29              PUB29               3        6434082
## 6  AT4G25880       APUM6              APUM6               4       13154708
##   end_position
## 1      5578086
## 2     15512218
## 3     17929138
## 4     10944727
## 5      6436046
## 6     13159382
##                                                                                                                                                                                                                                 description
## 1                                                                                                                                               NADPH-dependent oxidoreductase 2-alkenal reductase [Source:UniProtKB/Swiss-Prot;Acc:Q39172]
## 2                                                                                                                                              Beta-1,3-N-Acetylglucosaminyltransferase family protein [Source:UniProtKB/TrEMBL;Acc:F4JTI5]
## 3                                                                                                                                                                          RmlC-like cupins superfamily protein [Source:TAIR;Acc:AT2G43120]
## 4 unknown protein; Has 35333 Blast hits to 34131 proteins in 2444 species: Archae - 798; Bacteria - 22429; Metazoa - 974; Fungi - 991; Plants - 531; Viruses - 0; Other Eukaryotes - 9610 (source: NCBI BLink). [Source:TAIR;Acc:AT1G30814]
## 5                                                                                                                                                               RING-type E3 ubiquitin transferase [Source:UniProtKB/TrEMBL;Acc:A0A178VJJ8]
## 6                                                                                                                                                                                                     pumilio 6 [Source:TAIR;Acc:AT4G25880]

This table should contain all Arabidopsis genes from BioMart.

We can then filter it based on our peak QTL:

chr4_genes <- biomart_genes %>% 
  filter(chromosome_name == chr4_peak$chromosome &
         start_position > chr4_peak$window_start & 
           end_position < chr4_peak$window_end)

There’s now 29 in this list. We can further arrange them by their closeness to our top SNP:

# Get the middle position of each gene and get how far it is from the top SNP
# arrange by distance and get the first few rows of the table
chr4_genes %>% 
    mutate(mid_position = round((start_position + end_position)/2),
           dist_snp = abs(mid_position - chr4_peak$bp)) %>% 
    arrange(dist_snp) %>%
    head()
##   tair_locus tair_symbol external_gene_name chromosome_name start_position
## 1  AT4G00650         FRI                FRI               4         268901
## 2  AT4G00651                                              4         271487
## 3  AT4G00660         RH8                RH8               4         274252
## 4  AT4G00630        KEA2               KEA2               4         261246
## 5  AT4G00670                                              4         278316
## 6  AT4G00680        ADF8               ADF8               4         279596
##   end_position
## 1       271503
## 2       271879
## 3       278737
## 4       268097
## 5       279197
## 6       281072
##                                                                                                                                                                                                                                                                                                                                                                                                        description
## 1                                                                                                                                                                                                                                                                                                                                                Inactive protein FRIGIDA [Source:UniProtKB/Swiss-Prot;Acc:Q67Z93]
## 2 FUNCTIONS IN: molecular_function unknown; INVOLVED IN: biological_process unknown; LOCATED IN: endomembrane system; BEST Arabidopsis thaliana protein match is: Fe superoxide dismutase 2 (TAIR:AT5G51100.1); Has 34 Blast hits to 34 proteins in 11 spec /.../rchae - 0; Bacteria - 0; Metazoa - 0; Fungi - 0; Plants - 34; Viruses - 0; Other Eukaryotes - 0 (source: NCBI BLink). [Source:TAIR;Acc:AT4G00651]
## 3                                                                                                                                                                                                                                                                                                                                   DEAD-box ATP-dependent RNA helicase 8 [Source:UniProtKB/Swiss-Prot;Acc:Q8RXK6]
## 4                                                                                                                                                                                                                                                                                                                                                      K+ efflux antiporter 2 [Source:UniProtKB/TrEMBL;Acc:F4JHE9]
## 5                                                                                                                                                                                                                                                                                                                                                                   At4g00670 [Source:UniProtKB/TrEMBL;Acc:Q6IDB4]
## 6                                                                                                                                                                                                                                                                                                                                                        actin depolymerizing factor 8 [Source:TAIR;Acc:AT4G00680]
##   mid_position dist_snp
## 1       270202      660
## 2       271683      821
## 3       276494     5632
## 4       264672     6190
## 5       278756     7894
## 6       280334     9472

In this case, the gene FRI is the closest to our top SNP, and it is likely to be a good candidate (as it is involved in flowering).