• Connecting to ENSEMBL database
  • Querying individual database tables
  • Join tables and collect results

This document exemplifies how the dbplyr package can be used to interact with a remote (public) database.

We will access data from ENSEMBL as an example. ENSEMBL hosts a variety of data on many organisms such as genome assemblies, gene annotations, variants, and comparative genomics.

Connecting to ENSEMBL database

Before we connect to a database we need to know:

  • what type of database system it is (e.g. SQLite, MySQL, PostgreSQL, …)
  • any credentials needed to connect to that database (a username, password, etc)

A web-search for “public ENSEMBL datase” takes us to this page which gives all this information:

  • ENSEMBL uses MySQL or MariaDB databases to store its information (these two are kind of like “cousins” of each other… sorry, it’s confusing, but for our purposes they are identical)
  • There are several public servers referring to different versions of the databases and different types of information, we pick the top one in the table at the bottom of that page

We start by loading the necessary packages, which include the RMariaDB package, which allows us to connect to either MySQL or MariaDB databases:

library(tidyverse)
library(dbplyr)
library(RMariaDB)  # note this replaces RMySQL package which is being discontinued

Then, we use the information from the ENSEMBL page to make our connection:

ensembl_con <- dbConnect(MariaDB(),
                         host = "ensembldb.ensembl.org", 
                         user = "anonymous",
                         port = 5306,
                         password = "")

In reality, ENSEMBL hosts many databases, for different types of information. And each of those databases itself has many many tables. Details can be found on this page.

To see which databases are available for us to pick one and connect to, we can use this command:

dbGetQuery(ensembl_con, "SHOW DATABASES")

Wow, there are almost 9k databases that we can pick from! (ENSEMBL is a big infrastructure after all…). Let’s focus on the Comparative Genomics set of databases (they are called “compara”):

dbGetQuery(ensembl_con, "SHOW DATABASES") %>% 
  # filter the table for cases where the the string "compara" is present
  filter(str_detect(Database, "compara")) %>% 
  tail()
ABCDEFGHIJ0123456789
 
 
Database
<chr>
43ensembl_compara_90
44ensembl_compara_91
45ensembl_compara_92
46ensembl_compara_93
47ensembl_compara_94
48ensembl_compara_95

We still get many versions of this type of databases - these refer to the different versions of the database. Let’s pick the most recent one and connect to that one:

# This is similar to before, except now we are specifying the dbname
compara_con <- dbConnect(MariaDB(),
                         dbname = "ensembl_compara_95",
                         host = "ensembldb.ensembl.org", 
                         user = "anonymous",
                         port = 5306,
                         password = "")

Querying individual database tables

From here on, we can start using dbplyr syntax to explore this database. Let’s start by seeing the table names we have in this database:

# List table names in the database
src_dbi(compara_con)
## src:  mysql 5.6.33 [anonymous@ensembldb.ensembl.org:/ensembl_compara_95]
## tbls: CAFE_gene_family, CAFE_species_gene, conservation_score,
##   constrained_element, dnafrag, dnafrag_region, exon_boundaries,
##   external_db, family, family_member, gene_align, gene_align_member,
##   gene_member, gene_member_hom_stats, gene_member_qc, gene_tree_node,
##   gene_tree_node_attr, gene_tree_node_tag, gene_tree_object_store,
##   gene_tree_root, gene_tree_root_attr, gene_tree_root_tag, genome_db,
##   genomic_align, genomic_align_block, genomic_align_tree, hmm_annot,
##   hmm_curated_annot, hmm_profile, homology, homology_member,
##   mapping_session, member_xref, meta, method_link,
##   method_link_species_set, method_link_species_set_attr,
##   method_link_species_set_tag, ncbi_taxa_name, ncbi_taxa_node,
##   other_member_sequence, peptide_align_feature, seq_member,
##   seq_member_projection, seq_member_projection_stable_id, sequence,
##   species_set, species_set_header, species_set_tag, species_tree_node,
##   species_tree_node_attr, species_tree_node_tag, species_tree_root,
##   stable_id_history, synteny_region

Compare these with what is found on the respective information page.

Let’s say we want to know what the homology is between Humans and some other animals. One of the tables in the database is called “homology”, let’s look at it:

compara_homology <- tbl(compara_con, "homology")
head(compara_homology)
ABCDEFGHIJ0123456789
homology_id
<int>
method_link_species_set_id
<int>
description
<chr>
is_tree_compliant
<int>
dn
<dbl>
1109995other_paralog10.7404
2109995other_paralog10.6628
3109995other_paralog10.6180
4109995other_paralog10.6611
5109995other_paralog10.6661
6109995other_paralog10.7385

Well, we can see there’s some useful information there, like dn and ds metrics (non-synonymous and synonymous substitutions), but we don’t know to which species comparison each of these correspond to.

There is, however, a column called method_link_species_set_id, which allows us to match these entries to the entries in yet another table of our database:

compara_linksp <- tbl(compara_con, "method_link_species_set")
head(compara_linksp)
ABCDEFGHIJ0123456789
method_link_species_set_id
<int>
method_link_id
<int>
species_set_id
<int>
443132353
5441633947
5551634144
5691634350
5871634701
5921634706

And there we see that we have this same column, along with a column called “name”, which contains the shortened name of the species compared.

Let’s say we’re interested in ortholog genes between humans and either chimps or mice, and filter our table accordingly:

sapiens_ortho <- compara_linksp %>% 
  filter(name == "H.sap-P.tro orthologues" | name == "H.sap-M.mus orthologues")

sapiens_ortho
ABCDEFGHIJ0123456789
method_link_species_set_id
<int>
method_link_id
<int>
species_set_id
<int>
5097620135673
10389020139100

Join tables and collect results

So, now we can use this new table and join it with the homology table, using the method_link_species_set_id variable. Because this takes a while to run and it is the final result we are interested in, we also collect the results (i.e. import them into R as a tibble):

compara_result <- sapiens_ortho %>% 
  left_join(compara_homology, by = "method_link_species_set_id") %>% 
  collect()

head(compara_result)
ABCDEFGHIJ0123456789
method_link_species_set_id
<int>
method_link_id
<int>
species_set_id
<int>
5097620135673
5097620135673
5097620135673
5097620135673
5097620135673
5097620135673

So now, we can for example count how many orthologues there are for each species comparison:

compara_result %>% 
  count(name, description)
ABCDEFGHIJ0123456789
name
<chr>
description
<chr>
nn
<int>
H.sap-M.mus orthologuesortholog_many2many7259
H.sap-M.mus orthologuesortholog_one2many4047
H.sap-M.mus orthologuesortholog_one2one16678
H.sap-P.tro orthologuesortholog_many2many719
H.sap-P.tro orthologuesortholog_one2many3556
H.sap-P.tro orthologuesortholog_one2one21749

Or make a graph showing the dS/dN ratio distributions:

compara_result %>% 
  ggplot(aes(name, ds/dn, fill = description)) + 
  geom_boxplot() + scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 15402 rows containing non-finite values (stat_boxplot).