This document exemplifies how the dbplyr package can be used to interact with a remote (public) database.

We will access data from ENSEMBL as an example. ENSEMBL hosts a variety of data on many organisms such as genome assemblies, gene annotations, variants, and comparative genomics.

Connecting to ENSEMBL database

Before we connect to a database we need to know:

  • what type of database system it is (e.g. SQLite, MySQL, PostgreSQL, …)
  • any credentials needed to connect to that database (a username, password, etc)

A web-search for “public ENSEMBL datase” takes us to this page which gives all this information:

  • ENSEMBL uses MySQL or MariaDB databases to store its information (these two are kind of like “cousins” of each other… sorry, it’s confusing, but for our purposes they are identical)
  • There are several public servers referring to different versions of the databases and different types of information, we pick the top one in the table at the bottom of that page

We start by loading the necessary packages, which include the RMariaDB package, which allows us to connect to either MySQL or MariaDB databases:

library(tidyverse)
library(dbplyr)
library(RMariaDB)  # note this replaces RMySQL package which is being discontinued

Then, we use the information from the ENSEMBL page to make our connection:

ensembl_con <- dbConnect(MariaDB(),
                         host = "ensembldb.ensembl.org", 
                         user = "anonymous",
                         port = 5306,
                         password = "")

In reality, ENSEMBL hosts many databases, for different types of information. And each of those databases itself has many many tables. Details can be found on this page.

To see which databases are available for us to pick one and connect to, we can use this command:

dbGetQuery(ensembl_con, "SHOW DATABASES")

Wow, there are almost 9k databases that we can pick from! (ENSEMBL is a big infrastructure after all…). Let’s focus on the Comparative Genomics set of databases (they are called “compara”):

dbGetQuery(ensembl_con, "SHOW DATABASES") %>% 
  # filter the table for cases where the the string "compara" is present
  filter(str_detect(Database, "compara")) %>% 
  tail()

We still get many versions of this type of databases - these refer to the different versions of the database. Let’s pick the most recent one and connect to that one:

# This is similar to before, except now we are specifying the dbname
compara_con <- dbConnect(MariaDB(),
                         dbname = "ensembl_compara_95",
                         host = "ensembldb.ensembl.org", 
                         user = "anonymous",
                         port = 5306,
                         password = "")

Querying individual database tables

From here on, we can start using dbplyr syntax to explore this database. Let’s start by seeing the table names we have in this database:

# List table names in the database
src_dbi(compara_con)
## src:  mysql 5.6.33 [anonymous@ensembldb.ensembl.org:/ensembl_compara_95]
## tbls: CAFE_gene_family, CAFE_species_gene, conservation_score,
##   constrained_element, dnafrag, dnafrag_region, exon_boundaries,
##   external_db, family, family_member, gene_align, gene_align_member,
##   gene_member, gene_member_hom_stats, gene_member_qc, gene_tree_node,
##   gene_tree_node_attr, gene_tree_node_tag, gene_tree_object_store,
##   gene_tree_root, gene_tree_root_attr, gene_tree_root_tag, genome_db,
##   genomic_align, genomic_align_block, genomic_align_tree, hmm_annot,
##   hmm_curated_annot, hmm_profile, homology, homology_member,
##   mapping_session, member_xref, meta, method_link,
##   method_link_species_set, method_link_species_set_attr,
##   method_link_species_set_tag, ncbi_taxa_name, ncbi_taxa_node,
##   other_member_sequence, peptide_align_feature, seq_member,
##   seq_member_projection, seq_member_projection_stable_id, sequence,
##   species_set, species_set_header, species_set_tag, species_tree_node,
##   species_tree_node_attr, species_tree_node_tag, species_tree_root,
##   stable_id_history, synteny_region

Compare these with what is found on the respective information page.

Let’s say we want to know what the homology is between Humans and some other animals. One of the tables in the database is called “homology”, let’s look at it:

compara_homology <- tbl(compara_con, "homology")
head(compara_homology)

Well, we can see there’s some useful information there, like dn and ds metrics (non-synonymous and synonymous substitutions), but we don’t know to which species comparison each of these correspond to.

There is, however, a column called method_link_species_set_id, which allows us to match these entries to the entries in yet another table of our database:

compara_linksp <- tbl(compara_con, "method_link_species_set")
head(compara_linksp)

And there we see that we have this same column, along with a column called “name”, which contains the shortened name of the species compared.

Let’s say we’re interested in ortholog genes between humans and either chimps or mice, and filter our table accordingly:

sapiens_ortho <- compara_linksp %>% 
  filter(name == "H.sap-P.tro orthologues" | name == "H.sap-M.mus orthologues")

sapiens_ortho

Join tables and collect results

So, now we can use this new table and join it with the homology table, using the method_link_species_set_id variable. Because this takes a while to run and it is the final result we are interested in, we also collect the results (i.e. import them into R as a tibble):

compara_result <- sapiens_ortho %>% 
  left_join(compara_homology, by = "method_link_species_set_id") %>% 
  collect()

head(compara_result)

So now, we can for example count how many orthologues there are for each species comparison:

compara_result %>% 
  count(name, description)

Or make a graph showing the dS/dN ratio distributions:

compara_result %>% 
  ggplot(aes(name, ds/dn, fill = description)) + 
  geom_boxplot() + scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 15402 rows containing non-finite values (stat_boxplot).