Question

Different annotation packages (org.Mm.eg.db, EnsDb.Mmusculus.v79, BiomaRt) returning drastically different numbers of transcripts

0

Entering edit mode

Nicholas • 0

@3611f731

Last seen 19 hours ago

United States

I am currently in the process importing Kallisto data for a mouse RNAseq experiment using the tximport package and am creating tx2gene objects to use for mapping. I decided to try three different annotation packages: (1) org.Mm.eg.db, (2) EnsDb.Mmusculus.v79, and (3) biomaRt. However, when I ran the code (shown below) to extract the transcript to gene mappings, I only returned ~27k transcripts when using org.Mm.eg.db, but returned ~104k when using EnsDb.Mmusculus.v79 and ~280k when using biomaRt. Why are there such drastic differences?

I understand that biomaRt will always be the most up to date since it is pulled directly from Ensembl, but I was curious why org.Mm.eg.db had so few transcripts returned. Although the code below pertains to creating tx2gene objects, my question is more broadly related to annotation packages in general.

Code for getting transcripts with org.Mm.eg.db:

# Using org.Mm.eg.db
keys <- keys(org.Mm.eg.db, keytype = 'ENSEMBLTRANS')

tx2gene.org <- select(org.Mm.eg.db, keys, columns = c('ENSEMBLTRANS', 'SYMBOL'), keytype = 'ENSEMBLTRANS') %>%
  dplyr::rename(target_id = ENSEMBLTRANS, gene_name = SYMBOL)

length(keys)
dim(tx2gene.org)

> length(keys)
[1] 26481
> dim(tx2gene.org)
[1] 26909     2

Code for getting transcripts with EnsDb.Mmusculus.v79:

# Using EnsDb.Mmusculus.v79

tx2gene.ensdb <- transcripts(EnsDb.Mmusculus.v79, columns=c("tx_id", "gene_name")) %>%
  dplyr::as_tibble() %>%
  dplyr::rename(target_id = tx_id) %>%
  dplyr::select(target_id, gene_name)

dim(tx2gene.ensdb)

> dim(tx2gene.ensdb)
[1] 104129      2
`

Code for getting transcripts with biomaRt:

# Using biomaRt 
myMart <- useEnsembl(biomart = 'genes', dataset = 'mmusculus_gene_ensembl', mirror = 'useast')

tx2gene.bm <- getBM(mart = myMart, attributes = c('ensembl_transcript_id_version', 'external_gene_name')) %>%
  as_tibble() %>%
  dplyr::rename(target_id = ensembl_transcript_id_version, 
                gene_name = external_gene_name)

dim(tx2gene.bm)

> dim(tx2gene.bm)
[1] 278375      2

> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.62.0             EnsDb.Mmusculus.v79_2.99.0 org.Mm.eg.db_3.20.0        enrichplot_1.26.5         
 [5] msigdbr_7.5.1              clusterProfiler_4.14.4     gprofiler2_0.2.3           GSVA_2.0.4                
 [9] GSEABase_1.68.0            graph_1.84.0               annotate_1.84.0            XML_3.99-0.18             
[13] gplots_3.2.0               RColorBrewer_1.1-3         d3heatmap_0.9.0            DT_0.33                   
[17] gt_0.11.1                  plotly_4.10.4              cowplot_1.1.3              edgeR_4.4.1               
[21] limma_3.62.1               datapasta_3.1.0            beepr_2.0                  EnsDb.Hsapiens.v86_2.99.0 
[25] ensembldb_2.30.0           AnnotationFilter_1.30.0    GenomicFeatures_1.58.0     AnnotationDbi_1.68.0      
[29] Biobase_2.66.0             GenomicRanges_1.58.0       GenomeInfoDb_1.42.1        IRanges_2.40.1            
[33] S4Vectors_0.44.0           BiocGenerics_0.52.0        tximport_1.34.0            lubridate_1.9.4           
[37] forcats_1.0.0              stringr_1.5.1              dplyr_1.1.4                purrr_1.0.2               
[41] readr_2.1.5                tidyr_1.3.1                tibble_3.2.1               ggplot2_3.5.1             
[45] tidyverse_2.0.0            rhdf5_2.50.1              

loaded via a namespace (and not attached):
  [1] splines_4.4.2               later_1.4.1                 BiocIO_1.16.0               filelock_1.0.3             
  [5] ggplotify_0.1.2             bitops_1.0-9                R.oo_1.27.0                 httr2_1.0.7                
  [9] lifecycle_1.0.4             lattice_0.22-6              magrittr_2.0.3              yaml_2.3.10                
 [13] ggtangle_0.0.6              httpuv_1.6.15               DBI_1.2.3                   abind_1.4-8                
 [17] pkgload_1.4.0               zlibbioc_1.52.0             audio_0.1-11                R.utils_2.12.3             
 [21] RCurl_1.98-1.16             yulab.utils_0.1.9           rappdirs_0.3.3              GenomeInfoDbData_1.2.13    
 [25] ggrepel_0.9.6               irlba_2.3.5.1               tidytree_0.4.6              codetools_0.2-20           
 [29] DelayedArray_0.32.0         DOSE_4.0.0                  xml2_1.3.6                  tidyselect_1.2.1           
 [33] aplot_0.2.4                 farver_2.1.2                UCSC.utils_1.2.0            ScaledMatrix_1.14.0        
 [37] BiocFileCache_2.14.0        matrixStats_1.5.0           base64enc_0.1-3             GenomicAlignments_1.42.0   
 [41] jsonlite_1.8.9              progress_1.2.3              tools_4.4.2                 treeio_1.30.0              
 [45] Rcpp_1.0.13-1               glue_1.8.0                  SparseArray_1.6.0           qvalue_2.38.0              
 [49] MatrixGenerics_1.18.0       HDF5Array_1.34.0            withr_3.0.2                 BiocManager_1.30.25        
 [53] fastmap_1.2.0               rhdf5filters_1.18.0         caTools_1.18.3              digest_0.6.37              
 [57] rsvd_1.0.5                  gridGraphics_0.5-1          timechange_0.3.0            R6_2.5.1                   
 [61] mime_0.12                   colorspace_2.1-2            GO.db_3.20.0                gtools_3.9.5               
 [65] RSQLite_2.3.9               R.methodsS3_1.8.2           utf8_1.2.4                  generics_0.1.3             
 [69] data.table_1.16.4           rtracklayer_1.66.0          prettyunits_1.2.0           httr_1.4.7                 
 [73] htmlwidgets_1.6.4           S4Arrays_1.6.0              pkgconfig_2.0.3             gtable_0.3.6               
 [77] blob_1.2.4                  SingleCellExperiment_1.28.1 XVector_0.46.0              htmltools_0.5.8.1          
 [81] fgsea_1.32.0                ProtGenerics_1.38.0         scales_1.3.0                png_0.1-8                  
 [85] SpatialExperiment_1.16.0    ggfun_0.1.8                 rstudioapi_0.17.1           tzdb_0.4.0                 
 [89] reshape2_1.4.4              rjson_0.2.23                nlme_3.1-166                curl_6.1.0                 
 [93] cachem_1.1.0                KernSmooth_2.23-26          parallel_4.4.2              restfulr_0.0.15            
 [97] pillar_1.10.1               grid_4.4.2                  vctrs_0.6.5                 promises_1.3.2             
[101] BiocSingular_1.22.0         dbplyr_2.5.0                beachmat_2.22.0             xtable_1.8-6               
[105] magick_2.8.5                cli_3.6.3                   locfit_1.5-9.10             compiler_4.4.2             
[109] Rsamtools_2.22.0            rlang_1.1.4                 crayon_1.5.3                plyr_1.8.9                 
[113] fs_1.6.5                    stringi_1.8.4               viridisLite_0.4.2           BiocParallel_1.40.0        
[117] babelgene_22.9              munsell_0.5.1               Biostrings_2.74.1           lazyeval_0.2.2             
[121] GOSemSim_2.32.0             Matrix_1.7-1                patchwork_1.3.0             hms_1.1.3                  
[125] sparseMatrixStats_1.18.0    bit64_4.5.2                 Rhdf5lib_1.28.0             KEGGREST_1.46.0            
[129] statmod_1.5.0               shiny_1.10.0                SummarizedExperiment_1.36.0 igraph_2.1.2               
[133] memoise_2.0.1               ggtree_3.14.0               fastmatch_1.1-6             bit_4.5.0.1                
[137] gson_0.1.0                  ape_5.8-1

AnnotationDbi EnsDb.Mmusculus.v79 biomaRt org.Mm.eg.db • 276 views

ADD COMMENT • link updated 3 hours ago by James W. MacDonald 67k • written 1 day ago by Nicholas • 0

0

Entering edit mode

Just an update on this: I tried tximport with tx2gene objects from the three annotation packages and from the GTF file itself. The best performance was using the GTF file itself, thank you ATpoint for the suggestion. This begs the question: why use annotation packages at all, when the GTF file gives the best performance?

It makes sense that the GTF file would yield the best results as you create the kallisto index from the fasta file that is linked to that GTF. Should this be the standard for annotating kallisto inputs with tximport?

Code creating tx2gene object from GTF file:

# Getting annotations directly from gtf file ----
gtf.mm.ensembl <- rtracklayer::import('Mus_musculus.GRCm39.113.gtf') 

tx2gene.gtf <- mcols(gtf.mm.ensembl)[,c(7,10)] %>%
  as_tibble() %>%
  dplyr::rename(gene_name = gene_name, target_id = transcript_id) %>%
  dplyr::select(target_id, gene_name) %>%
  na.omit() %>%
  distinct(target_id, .keep_all = T)

ADD REPLY • link 1 day ago Nicholas • 0

0

Entering edit mode

Annotation packages have their value. They are a convenient and stable way to make annotations accessible inside R, they are version-controlled etc. YOu have a corner case now in which, together with how you preprocessed your data, got better performance with another annotation source, but that's not the annotation package fault. If you had built your kallisto index with transcripts retrieved in some way via an annotation package then this should give same performance as you experience now with the gtf.

ADD REPLY • link 8 hours ago ATpoint ★ 4.6k

0

Entering edit mode

It's not the version control in R/Bioconductor though. It's the version of the genome. OP used Ensembl build 113 to align and then noted that using an EnsDb based on Ensembl build 79, which is ten years old and based on GRCm38 instead of GRCm39 didn't match well. I can't get the Biomart server to load, but I'd bet a dollar it's still on version 112, which will result in some mismatching as well.

There are more recent vintage EnsDb packages on the AnnotationHub that would be more useful (up to Ensembl build 112).

Given that the genome is a moving target, it's critical to ensure that whatever is used for alignment and whatever is used for downstream annotation are from the same release.

ADD REPLY • link 3 hours ago James W. MacDonald 67k

score 1 · Answer 1 · 2025-01-09

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

The reason you get fewer mappings using the org.Mm.eg.db package is because that package is based on annotations from NCBI, but you are using Ensembl IDs. The two annotation services use different methods to determine where the genes are, how many transcripts each have, etc. Because of that, there isn't very good consistency between NCBI and Ensembl IDs

There is an ongoing project called MANE where NCBI and Ensembl/Genecode are collaborating to find a single transcript for each human gene that they can agree is the same thing, which indicates how difficult it is to map between annotation services.

It has been my longstanding contention around these parts that you should stick with whatever IDs you started with, because cross-mapping is a fool's errand. If you have Ensembl transcripts, stick with an EnsDb or biomaRt.

ADD COMMENT • link 1 day ago James W. MacDonald 67k

1

Entering edit mode

that you should stick with whatever IDs you started with,

Meaning, you got your fasta file for the kallisto index from somewhere, e.g. Ensembl or GENCODE, so I recommend to take the GTF file from there and make the tx2gene from this GTF, ensuring that the annotation version is the same. Eg transcript fasta from GENCODE vM23, then take the vM23 GTF, and none of these annotation packages. This only creates trouble, unless you obtained the transcript sequences exactly from the same package.

ADD REPLY • link 1 day ago ATpoint ★ 4.6k

0

Entering edit mode

I agree that it's difficult to match an EnsDb object to the build, but I think that is orthogonal to OP's issue, which is mapping transcript IDs to gene symbols, and for which an EnsDb or biomaRt will be required.

ADD REPLY • link 1 day ago James W. MacDonald 67k

0

Entering edit mode

Got it, so using the GTF directly rather than any of the annotation packages will yield more accurate results than any of the packages? I got my fasta file from Ensembl.

ADD REPLY • link 1 day ago Nicholas • 0

0

Entering edit mode

Given that the annotation versions were the same the results should be identical, but for such a simple task as making a tx2gene why not simply poulling the gtf. rtracklayer::import can directly download a URL into R giving you a GRanges representation that you can filter easily. I am not saying that the annotation packages were no good or inaccurate, I just say that in some cases there are simpler alternatives.

ADD REPLY • link 9 hours ago ATpoint ★ 4.6k