dbplyr
This document exemplifies how the dbplyr
package can be used to interact with a remote (public) database.
We will access data from ENSEMBL as an example. ENSEMBL hosts a variety of data on many organisms such as genome assemblies, gene annotations, variants, and comparative genomics.
Before we connect to a database we need to know:
A web-search for “public ENSEMBL datase” takes us to this page which gives all this information:
We start by loading the necessary packages, which include the RMariaDB
package, which allows us to connect to either MySQL or MariaDB databases:
library(tidyverse)
library(dbplyr)
library(RMariaDB) # note this replaces RMySQL package which is being discontinued
Then, we use the information from the ENSEMBL page to make our connection:
ensembl_con <- dbConnect(MariaDB(),
host = "ensembldb.ensembl.org",
user = "anonymous",
port = 5306,
password = "")
In reality, ENSEMBL hosts many databases, for different types of information. And each of those databases itself has many many tables. Details can be found on this page.
To see which databases are available for us to pick one and connect to, we can use this command:
dbGetQuery(ensembl_con, "SHOW DATABASES")
Wow, there are almost 9k databases that we can pick from! (ENSEMBL is a big infrastructure after all…). Let’s focus on the Comparative Genomics set of databases (they are called “compara”):
dbGetQuery(ensembl_con, "SHOW DATABASES") %>%
# filter the table for cases where the the string "compara" is present
filter(str_detect(Database, "compara")) %>%
tail()
We still get many versions of this type of databases - these refer to the different versions of the database. Let’s pick the most recent one and connect to that one:
# This is similar to before, except now we are specifying the dbname
compara_con <- dbConnect(MariaDB(),
dbname = "ensembl_compara_95",
host = "ensembldb.ensembl.org",
user = "anonymous",
port = 5306,
password = "")
From here on, we can start using dbplyr
syntax to explore this database. Let’s start by seeing the table names we have in this database:
# List table names in the database
src_dbi(compara_con)
## src: mysql 5.6.33 [anonymous@ensembldb.ensembl.org:/ensembl_compara_95]
## tbls: CAFE_gene_family, CAFE_species_gene, conservation_score,
## constrained_element, dnafrag, dnafrag_region, exon_boundaries,
## external_db, family, family_member, gene_align, gene_align_member,
## gene_member, gene_member_hom_stats, gene_member_qc, gene_tree_node,
## gene_tree_node_attr, gene_tree_node_tag, gene_tree_object_store,
## gene_tree_root, gene_tree_root_attr, gene_tree_root_tag, genome_db,
## genomic_align, genomic_align_block, genomic_align_tree, hmm_annot,
## hmm_curated_annot, hmm_profile, homology, homology_member,
## mapping_session, member_xref, meta, method_link,
## method_link_species_set, method_link_species_set_attr,
## method_link_species_set_tag, ncbi_taxa_name, ncbi_taxa_node,
## other_member_sequence, peptide_align_feature, seq_member,
## seq_member_projection, seq_member_projection_stable_id, sequence,
## species_set, species_set_header, species_set_tag, species_tree_node,
## species_tree_node_attr, species_tree_node_tag, species_tree_root,
## stable_id_history, synteny_region
Compare these with what is found on the respective information page.
Let’s say we want to know what the homology is between Humans and some other animals. One of the tables in the database is called “homology”, let’s look at it:
compara_homology <- tbl(compara_con, "homology")
head(compara_homology)
Well, we can see there’s some useful information there, like dn
and ds
metrics (non-synonymous and synonymous substitutions), but we don’t know to which species comparison each of these correspond to.
There is, however, a column called method_link_species_set_id
, which allows us to match these entries to the entries in yet another table of our database:
compara_linksp <- tbl(compara_con, "method_link_species_set")
head(compara_linksp)
And there we see that we have this same column, along with a column called “name”, which contains the shortened name of the species compared.
Let’s say we’re interested in ortholog genes between humans and either chimps or mice, and filter our table accordingly:
sapiens_ortho <- compara_linksp %>%
filter(name == "H.sap-P.tro orthologues" | name == "H.sap-M.mus orthologues")
sapiens_ortho
So, now we can use this new table and join it with the homology
table, using the method_link_species_set_id
variable. Because this takes a while to run and it is the final result we are interested in, we also collect the results (i.e. import them into R as a tibble
):
compara_result <- sapiens_ortho %>%
left_join(compara_homology, by = "method_link_species_set_id") %>%
collect()
head(compara_result)
So now, we can for example count how many orthologues there are for each species comparison:
compara_result %>%
count(name, description)
Or make a graph showing the dS/dN ratio distributions:
compara_result %>%
ggplot(aes(name, ds/dn, fill = description)) +
geom_boxplot() + scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 15402 rows containing non-finite values (stat_boxplot).