Abstract
Motivation
Accurate determination and quantification of the taxonomic composition of microbial communities, especially at the species level, is one of the major issues in metagenomics. This is primarily due to the limitations of commonly used 16S rRNA reference databases, which either contain a lot of redundancy or a high percentage of sequences with missing taxonomic information. This may lead to erroneous identifications and, thus, to inaccurate conclusions regarding the ecological role and importance of those microorganisms in the ecosystem.
Results
The current study presents MIMt, a new 16S rRNA database for archaea and bacteria’s identification, encompassing 47 001 sequences, all precisely identified at species level. In addition, a MIMt2.0 version was created with only curated sequences from RefSeq Targeted loci with 32 086 sequences. MIMt aims to be updated twice a year to include all newly sequenced species. We evaluated MIMt against Greengenes, RDP, GTDB and SILVA in terms of sequence distribution and taxonomic assignments accuracy. Our results showed that MIMt contains less redundancy, and despite being 20 to 500 times smaller than existing databases, outperforms them in completeness and taxonomic accuracy, enabling more precise assignments at lower taxonomic ranks and thus, significantly improving species-level identification.
Similar content being viewed by others
Introduction
Microorganisms are the most diverse and abundant group of living organisms on Earth. All Bacteria, Archaea, and most lineages of the Eukarya domain belong to this group [35]. They live in almost every type of habitat (aquatic, terrestrial, atmospheric, or living host), and some of them thrive in extremely harsh conditions [29, 35]. Although some species are harmful to certain plants and animals and may cause severe diseases (e.g., tuberculosis, cholera, diphtheria, etc.), most microorganisms provide beneficial services. They are essential for the proper functioning of ecosystems, playing an important role in biogeochemical cycles, aiding in the decomposition process or food fermentation, and making up the human microbiota, which is crucial for the proper functioning of the digestive and immunological systems [35, 39, 41]. However, despite their importance, most species have yet to be characterised, mainly due to the widespread lack of suitable culturing methods [14].
Over the past decade, the development of high-throughput sequencing technologies has expanded and altered microbiologists’ view of microorganisms and how to study them. In particular, metagenomics, the study of genetic material recovered directly from environmental samples [17], has been successfully used to identify previously unrecognized species [9], quantify the taxonomic composition of a microbiome [21, 32], and unravel their interactions and functions in the ecosystem [3],thus, improving our knowledge and understanding of microbial diversity without prior cultivation [17, 18].
For bacteria and archaea, the 16S small-subunit ribosomal RNA (16S rRNA) gene has been established as the ‘gold standard’ for identifying and characterizing the diversity of metagenomic samples, as it is present in almost all species, its function is highly conserved, and its size (1550 bp) is large enough for bioinformatic purposes [6, 19]. The presence of hypervariable regions flanked by conserved ones, that universal primers can target, allows discrimination between different species [6, 19]. However, an accurate species-level classification for 16S rRNA gene sequences remains a challenge for microbiome researchers, because of the quality and completeness of most used reference databases [2, 5, 34], namely Greengenes [12], the Ribosomal Database Project [11], SILVA [31] and the Genome Taxonomy Database [27]. These databases have several shortcomings, namely gaps in taxonomic coverage, mislabeled sequences, excessive redundancy, and incomplete annotation (especially at low taxonomic levels). In addition, these databases have different underlying taxonomies, and differ in size, taxonomic composition, alignment and classification approaches [1, 5, 13]. For instance, Greengenes provides bacterial and archaeal taxonomy based on automatic de novo tree construction of quality-filtered sequences. Still, despite being one of the historical 16S reference databases it has been 10 years without updating. Moreover, although most sequences have taxonomic information up to the order level, only over half of them are annotated at family genus level [23, 26] and less than 15 percent of them have species taxonomy assigned. The RDP database contains bacterial and archaeal small subunit (SSU) rRNA gene sequences and fungal large subunit (LSU) rRNA gene sequences [11, 22], most of them obtained from the International Nucleotide Sequence Database Collaboration (INSDC; 27). However, this database has not been updated since September 2016. Taxonomy assignments are obtained using the Naïve Bayesian Classifier [42] and named according to Bergey’s taxonomy [43]. Although most sequences have complete taxonomy up to species level, most are annotated as ‘uncultured’ or ‘unidentified’ taxa [11, 22]. On the other hand, SILVA contains taxonomic information for all three domains of life [30, 31, 44]. It is the only manually curated database but since 2020 it has not been updated anymore. Taxonomic classifications follow Bergey's taxonomy [43] and the List of Prokaryotic Names with Standing in Nomenclature [28]. Initially, this database was designed to store all 16S rRNA sequences present in publicly available works and not to be used as a reference database for microbial identification. For this reason, its distribution was biased. However, the latest version incorporates a non-redundant dataset (Ref NR 99) where highly identical sequences have been removed [31]. Despite this, most sequences are identified as ‘uncultured’, thus, not providing any clue about the identity of the existing microorganisms [25]. Finally, GTDB was recently developed to provide a standardized bacterial and archaeal taxonomy based on genome phylogeny [26, 27]. It has been kept up to date until now and most sequences are identified to species level,however, it contains huge redundancy, and the taxonomic classification in some instances employs non-standard definitions (i.e. Pseudomonas_A gstutzeri_C and Pseudomonas_A gstutzeri_H or Luteimonas gsp001717465) that inflate the database's species counts and potentially cause errors in estimation methods (i.e., the diversity of a sample).
An accurate and reliable taxonomic identification is a critical first step in metagenomic analyses, which is highly affected by the choice of the database. The use of incomplete or biased databases, with the issues previously mentioned, can lead to erroneous interpretations of the community composition; and thus, to incorrect conclusions regarding the ecological role and importance of those microorganisms in that ecosystem [5, 9, 25]. The aim of the present study was to develop a novel, compact 16S rRNA database for accurate and reliable bacteria and archaea taxonomic classification. This database excludes sequences not identified at the species level or with a vague taxonomy description, and it is intended to be used as a reference collection in any kind of microbiome study. The performance of this new database, named MIMt (Mass Identification in Metagenomic tests), was compared against state-of-the-art databases (Greengenes, RDP, SILVA and GTDB) in terms of sequence distribution and accuracy of taxonomic assignments.
Material and methods
MIMt construction
To construct MIMt, all representative, references and the “latest assembly” of every genome belonging to archaea or bacteria were downloaded from the National Center for Biotechnology Information (NCBI) FTP site (ftp.ncbi.nlm.nih.gov/genomes/). When more than one genome from the same species was available, only the most recent was kept thus avoiding within-species redundancy.
For each retrieved genome, the exact location (start and end position) of 16S rRNA sequences was identified using RNAmmer 1.2 [20], which relies on Hidden Markov Models (HMMs) for both speed and accuracy (Fig. 1). Then, the corresponding sequences were extracted and put into a file to build the 16S MIMt database. The NCBI Taxonomy database [16, 37] was used to assign the complete taxonomy to each sequence. In this database, each taxon is identified with a stable and unique numerical identifier (taxid), linked to its full taxonomic classification [16]. Taxids were obtained by entering the taxon name for each sequence in the NCBI database Taxonomy Browser and then from the NCBI taxdump file, which comprises the list of all taxid available at NCBI and their corresponding classification, it was obtained the full taxonomic information of each 16S microbial sequence. Taxonomy was subsequently formatted by adding the appropriate taxonomic rank prefixes (K__ for Kingdom, P__ for Phylum, C__ for Class, O__ for Order, F__ for Family, G__ for Genus, and S__ for Species), to be comparable with the most used reference databases. Sequences from uncultured or unidentified organisms and those not identified up to species level were removed from the database.
Recently, MIMt2.0 was built following a different strategy to ensure all sequences, in addition to having complete taxonomy up to the species level, are checked and curated manually at all taxonomic levels. MIMt2.0 contains all the sequences from RefSeq Targeted loci 16S ribosomal RNA project belonging to both Archaea and Bacteria, and it is complemented with 16S sequences predicted through RNAmmer from complete genomes deposited at RefSeq for the species missing at Targeted loci.
The list of species composing the final versions of MIMt and MIMt2.0 were compared, and all those present in MIMt2.0 (curated) but absent in MIMt were included in the not curated version. Thus, in the end, MIMt contains all the 16S from known species, while MIMt2.0 lacks 16S sequences from some species (not yet curated), but all the sequences on it are manually curated at all taxonomic levels by RefSeq. The complete set of sequences for all species in MIMt was used to generate an alignment with Mafft and a phylogenetic tree with Fasttree through the function align-to-tree-mafft-fasttree in Qiime2 (version 2023.5). The phylogenetic tree was finally rendered using Empress [8] and its Qiime2 plugin (Additional file 1).
Benchmarked reference databases for 16S sequence classification
The most widely used reference databases (Greengenes, RDP, SILVA and GTDB) were used to test their performance in classifying 16S sequences against both MIMt versions. These databases are built in two standard formats: (1) including the taxonomy in the sequence file, such as RDP, SILVA and GTDB; or (2) with the taxonomy in a different file associated with the sequence ID, such as Greengenes. The recent Greengenes2 version [24] was used. This version has been designed mainly to be used through a QIIME2 plugin, but Fasta and taxonomy files are also available. From the RDP database, two files—one containing archaea (“current_Archaea_unaligned.fa.gz”) and other containing bacteria (“current_Bacteria_unaligned.fa.gz”) from the release 11 were downloaded and compiled into a single file. From SILVA database, the non-redundant version of the SSU Ref dataset (“SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz”), recommended to be used as the reference dataset for rRNA classification [31], was downloaded. Finally, for the GTDB database, the file “ssu_all_r214.fna” which includes the sequence and the taxonomy itself was used.
Species distribution in the databases
Following the taxonomic classification of every sequence in the databases, the number of species was estimated. In addition, sequences with valid taxonomic information for each level were counted to estimate the annotation efficiency as we move forward in the taxonomy. Thus, databases populated in sequences identified just with the environment where they were isolated will fail to classify sequences at the lower levels such as genus and species.
The redundancy of the databases was also calculated based on the number of sequences belonging to each species they contain. Considering that a bacterial genome contains from one to 15 16S rRNA molecules with an average of 4.2 molecules per genome [40], it is expected to find some redundancy just because of the nature of the sequence itself. Nevertheless, any number higher than that will indicate that the database contains repeated sequences from different isolates or cultures that do not provide extra information and reduce efficiency in classifying species. To verify this bias, all sequences belonging to the same species were grouped and the species with the highest to lowest representation were ordered. Then, from the most represented species to the least represented one the percentage that represents over the total was calculated.
The list of species present in each database was used to evaluate the similarity between the different databases through a Venn diagram using the library VennDiagram [10] in RStudio v. 2022.12.0 + 353 [33]. The species shared by all or some of the databases give an idea of the specificity of each database when assigning the taxonomy.
Analysis of redundancy in MIMt database
A redundancy analysis was carried out on the sequences included in the database to estimate the reliability of taxonomic assignments performed by MIMt when using the most commonly used variable regions for metagenomic analyses of 16S. To this end, a redundancy analysis was carried out on the sequences included in the databases V1-V3, V3-V5, V4 and V6-V9 were extracted from the MIMt database, recreating 4 different databases containing only these regions for all sequences contained in MIMt. Every sequence inside these databases was compared with the rest of the sequences on it excluding those comparisons that did not involve at least the full sequence of the shortest sequence of the comparison. For each pair of sequences compared, the percentage of identity between them and their taxonomic classification at all levels was obtained.
Considering as a reference for determining species, genus and families, the values of 98, 95 and 90% of similarity, each comparison was marked as true positive (TP), false positive (FP), true negative (TN) or false negative (FN) whether the percentage of identity and its taxonomic classification at each level was consistent or not.
Thus a pair of sequences with 96% of similarity sharing the same genus represents a TP, whereas if their taxonomic annotation at the species level is the same, it represents an FN.
Accuracy test on taxonomy assignment
The accuracy of the taxonomic assignment using each database was assessed using the MBARC-26 dataset which is composed of 16S sequences from 23 species of bacteria and 3 of archaea, with different size ranges, GC content and number of copies of 16S rRNA molecules [38]. This mock community is composed of prokaryotes isolated from heterogeneous environments such as soil and water and derived from host interaction with human, bovine and frog. The specific datasets used in this experiment are composed of long-read PacBio SMRT (accession SRR3656744) and paired end short reads from Illumina (accession SRR3656745). Since the Illumina dataset is composed of almost 174 million reads, a subsample of 30 thousand reads was extracted for the experiment through the tool seqtk (https://github.com/lh3/seqtk/). To minimize interference in the final result, sequences from the mock were not pre-processed. They were used directly as input for the Mothur algorithm classify.seqs, thereby eliminating bias by algorithm or data processing. The performance of the RDP database could not be tested since the size of the database required more than 512 GB of memory. Greengenes2, being seven times bigger than RDP, required more than 1 Tb of memory. Alternatively, only sequences belonging to the backbone composed of full-length 16S were used, excluding the v4 short sequences for this study [24].
Classification performance test
To assess MIMt’s performance against existing databases, we replicated the analyses of two previously published studies [15, 21] based on two completely different scenarios. The first included samples from colon, cloaca, and magnum from 34-week-old laying chickens (accession number: PRJNA604381), and the second focused on microorganism biodiversity in compost from different sources (accession number: MG-RAST mgp94523). Only one replicate per condition was used: 1) Cloaca sample (SRR10998526), magnum sample (SRR10998535), and colon sample (SRR10998554) from the chicken gut microbiota study,and 2) Vegetable waste (sample 3 rep.1), “alpeorujo” (sample 2 rep. 1), sewage sludge (sample 3 rep.2), agro waste (sample 1 rep. 3) and urban solid waste (sample 2 rep. 3) from the compost microbiota study.
These datasets were processed using three popular algorithms in metagenomic analysis: QIIME2 [4], DADA2 [7], and Mothur [36]. When using QIIME2, sequences were denoised using DADA2, quality filtered, clustered into Operational taxonomic units (OTUs), and subsequently assigned to a reference taxonomy. In the case of DADA2, the protocol consisted of filtering and trimming of the sequences, learning the error rates, sample inference, merging of paired reads (not applicable to the chicken study for being single-end sequencing), and taxonomy assignment of the Amplicon Sequence Variants (ASVs) obtained. Finally, the protocol followed when using Mothur consisted of transforming the Fastq files to Fasta and then assigning the taxonomy to each sequences using the function classify.seqs (Fig. 2). All the analyses were carried out on a server with 64 cores and 256 GB of RAM.
The analysis of the results considered the sequences annotated using each database, the number of distinct species identified, and the time spent on taxonomic identification. In some cases, the taxonomic assignment was performed on OTUs or ASVs, while in the case of Mothur it was performed directly on each of the sequences of the original dataset. Therefore, the results had to be normalized to the final number of sequences to be annotated after the pre-processing.
For the final count of classified sequences, all values showing simply a vague description of the sequence or internal codes were excluded (e.g., marine bacterium, unclassified Pseudomonadales, uncultured archaeon, and sp22811).
Results
Database composition
To construct the new 16S reference database for microorganism identification, a total of 23 780 genomes, 851 belonging to archaea and 22 929 from bacteria, were downloaded from NCBI. A total of 47 001 16S rRNA molecules were identified along the genomes, which became part of the complete MIMt database (named MIMt_24_4 to differentiate from the only curated version named MIMt2.0; Table 1). The number of 16S molecules across species varies between 1 and 37 in Tumebacillus avium, with an average of 2.18, which indicates that MIMt lacks redundancy.
Every single sequence that constitutes the MIMt database was identified at species level (Fig. 3) through the NCBI Taxonomy. Moreover, most sequences have taxonomic information at all classification levels (Fig. 3), being class and order the ones with the lowest percentage of sequences assigned (99.25 and 99.39%, respectively) caused by some species that lack taxonomy at intermediate levels.
The results show that the other databases are also consistent at higher taxonomic levels (kingdom, phylum, and class), with around 70% of the sequences taxonomically annotated (Fig. 3). However, in Greengenes2, RDP and SILVA, less than 60% of the sequences have valid taxonomic information at genus level, and the percentage drops below 20% at the species-level (Fig. 3). GTDB is the only database that maintains high levels of sequences with valid taxonomy at genus and species levels, with almost 96 and 90% of sequences annotated, respectively (Fig. 3; Additional file 2). It is worth mentioning that RDP and SILVA databases contain most of their sequences annotated at species level. However, many of them contain terms like “Taxon”, “genomosp.”, “endosymbiont”, thus being considered as uninformative. Greengenes2 is the database where the proportion of taxonomically annotated sequences falls most drastically below 70% even at the class level. In addition, since Greengenes2 has collected sequences from GTDB, it has also inherited their particular taxonomy.
On the other hand, the analyses of sequence distribution at species level, revealed that MIMt, as expected, contains minimal redundancy as a direct result of the abundance of 16S molecules in each genome (Fig. 4, Additional file 3). The results show that GTDB and Greengenes are the databases with the highest redundancy (Fig. 4). These two databases show a similar tendency, reaching 75% of the sequences annotated at species level with only 203 and 693 unique species respectively. Escherichia coli was the most represented species in GTDB with 87 428 sequences and Faecalibacterium prausnitzii was the most frequent one in Greengenes2 with 71 126 sequences (percentages in the graph were normalised to the total sequences annotated at species level). The RDP database contains a bit less redundancy (Fig. 4) but 75% of the database sequences belong to 735 species out of the 11 999 it contains. Finally, for the SILVA database, 75% of the sequences belong to 2 065 species from the 15 644 it has in total.
Another essential aspect to analyze is the representation of species in the different databases, which allows us to infer if all share the same species or if there are exclusive species in any of them. Our results showed that most of the species (7 006) are shared by all databases (Fig. 5). The number of species not present in MIMt is explained since most of these sequences come from 16S targeted sequencing, while the whole genome is not yet available.
Redundancy test on MIMt database
The establishment of the redundancy in the MIMt database for each of the variable regions under study was estimated by calculating sensitivity and specificity for each of the taxonomic levels separately (species, genus and family).
Thus, matches between pairs that had the same taxonomic classification (depending on the level) sharing a higher level of similarity than the one established for that level were considered true positives.
False positives were considered all those matches that having a greater similarity to the one established for that level did not have the same taxonomic annotation.
True negatives were those comparisons that did not share the taxonomic annotation but had a similarity below what was established for the level studied.
False negatives were those matches that had a lower similarity than the one established for that taxonomic level, sharing the same taxonomic annotation.
Under these assumptions and following the formulas of sensitivity and specificity, the results shown in Table 2 were obtained.
Accuracy on mock sample
To evaluate the accuracy of each database in the taxonomic assignment, a classification analysis was performed on a controlled sample with well-identified species. Both long-reads from PacBio sequencing and short reads from Illumina were used to test the 16S rRNA sequence classification. The mock community used in this analysis was MBARC-26 (SRR3656744 and SRR3656745) and the classification software was Mothur, with the option classify.seqs. This method was chosen because it requires less sequence manipulation to directly match the input sequence against the provided database.
The evaluation of the results consisted of counting how many of the species in the mock were correctly identified at the genus and/or species level (Fig. 6).
Taxonomic classification test
To establish the actual taxonomic allocation capacity of MIMt, 16S rRNA data from two studies already published were analyzed using QIIME2 (Fig. 7), DADA2 (Fig. 8) and Mothur (Fig. 9). The datasets include data from different sources such as poultry intestine, differentiating between cloaca (A), magnum (B), and colon (C); and compost microbiota from vegetable waste (D), “alpeorujo” (E), sewage sludge (F), agricultural waste (G) and urban solid waste (H). In each case, the time it took to classify the datasets, the number of sequences annotated at the species level and the number of species identified were measured.
The results show that using both versions of the MIMt database yielded the fastest classification times, followed by GTDB, Greengenes2, SILVA and RDP (Figs. 7, 8 and 9). The time spent on the analyses was proportional to the database's size and each database's annotation structure. For instance, GTDB and SILVA have a similar number of sequences, but since GTDB sequences are better annotated at all levels, resulting in a higher efficiency in the time spent in classification. MIMt showed an exceptional balance between the time spent on classifying the datasets and the percentage of sequences annotated at the species level becoming the database with the most number of sequences classified and the highest number of species identified (Fig. 10; Additional files 4–6).
Discussion
MIMt is a compact, non-redundant 16S rRNA database with all its sequences annotated to the species level. Despite its smaller size compared to other databases, MIMt has demonstrated a very good performance in metagenomic sequence classification. In the mock test under controlled conditions, MIMt has shown a high level of accuracy being able to classify the majority of species present in the sample, not just annotating those species that were in a very low proportion in the sample. On the other hand, in the case studies considered, MIMt successfully classified over 50% more sequences than any other database (Fig. 10). The taxonomic classification tests carried out in this study show MIMt's good performance in all conditions, making it particularly helpful in the characterization at the species level of a metagenomic sample. This is particularly useful in detecting pathogenic microorganisms or in diversity tests (alpha or beta) where incomplete or ambiguous descriptions in the taxonomy can lead to inaccurate conclusions. When is needed to identify a pathogenic microorganism in a sample or, on the contrary, beneficial agents as happens in fecal transplants, it is of great importance that apart from finding a good result against some of the sequences of the database, that sequence should be annotated accordingly or else uncertainty about the nature of the micro-organism would rule out the result.
Each sequence in MIMt represents a complete 16S molecule that encompasses both variable and constant regions. The full version of MIMt includes only molecules from each representative species which ensures that sequencing of any part of the molecule can be covered while avoiding having redundant sequences, which entails a substantial decrease in the time required to classify a dataset. The curated version of MIMt contains only curated sequences coming from Targeted loci and genomes section from RefSeq which makes it quite different from MIMt without curation, not only in the process the sequences are obtained but also in results. The full version of MIMt has all 16S molecules of every single species, whereas MIMt2.0 contains only manually curated sequences but in many cases not all the 16S molecules for every species are represented (i.e. E. coli is known to have 7 copies of 16S however in Targeted loci only three copies are found, belonging to different strains, which makes it difficult to see if they belong to the same or different locus).
It should be highlighted that MIMt contains more than 1 600 species of bacteria and archaea that are not present in any of the other databases analysed (Fig. 5). Furthermore, every single sequence in MIMt comes from a sequenced organism identified and deposited in the NCBI database. This approach avoids the inclusion of isolated fragments corresponding to 16S molecules that may be taxonomically misallocated due to contamination.
The results obtained by the two versions of MIMt show a good performance in terms of time invested in classifying and the number of sequences identified compared to the other databases. If we compare the performance between them we can see how in some cases the results are better on the curated version while in other cases the full version gets more sequences identified a priori.
When taxonomically classifying a sequence, it is easy to perceive that we will get better results when the database contains a sequence that, although less similar to our query, is well annotated at all levels, rather than if we have exactly the same sequence but its annotation is vague and incomplete. Traditionally there is a great effort for databases to grow without putting too much emphasis on an aspect as crucial as quality. When a database used to assign taxonomy grows in number of sequences but these lack information at low levels or such information is merely descriptive without providing real taxonomy, the search time will increase due to the database size increase, but as a drawback, the only benefit we will have is to classify some more sequences with just a vague description. This will prevent these sequences to be annotated with another sequence of the database that, although similar to a lesser degree, does have taxonomic information. As the number of sequences in the database increases, it is easier to find cross-species identities so no taxonomic assignment will be performed due to ambiguity. It will increase the probability of finding the same sequence or one of the same species in the database. Still the likelihood of finding a cross match increases significantly more, resulting in a loss in classification efficiency.
Based on the good results demonstrated by MIMt in this work, we present this new database as a powerful tool in identifying metagenomic samples that consumes very few computational resources for a fast and efficient taxonomic classification.
Availability of data and materials
MIMt is freely available for non-commercial purposes at https://mimt.bu.biopolis.pt. Additional files available at BioStudies under accession S-BSST1244 (https://www.ebi.ac.uk/biostudies/studies?query=S-BSST1244).
References
Balvociute M, Huson DH. SILVA, RDP, greengenes, NCBI and OTT—how do these taxonomies compare? BMC Genom. 2017;18(Suppl 2):114.
Bengtsson-Palme J, et al. Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics. 2016;16:2454–60.
Bohan DA, et al. Next-generation global biomonitoring: large-scale, automated reconstruction of ecological networks. Trends Ecol Evo. 2017;32:477–87.
Bolyen E, et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol. 2019;37:852–7.
Brietwieser FP, Lu J, Salzber SL. A review of methods and databases for metagenomic classification and assembly. Brief Bioinf. 2019;20:1125–39.
Boughner LA, Singh P. Microbial ecology: where are we now? Postdoc J. 2016;4:3e17.
Callahan BJ, et al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods. 2016;13(7):581–3.
Cantrell K, et al. EMPress enables tree-guided, interactive, and exploratory analyses of multi-omic data sets. Microb Ecol. 2021;6:2.
Chalita M, et al. Improved metagenomic taxonomic profiling using a curated core gene-based bacterial database reveals unrecognized species in the genus Streptococcus. Pathogens. 2020;9:204.
Chen H, Boutros PC. VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R. BMC Bioinf. 2011;12:35.
Cole JR, et al. Ribosomal database project: data and tools for highthroughput rRNA analysis. Nucleic Acids Res. 2014;42:D633–42.
DeSantis TZ, et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol. 2006;72:5069–72.
Dueholm MS, et al. Generation of comprehensive ecosystems-specific reference databases with species-level resolution by high-throughput full-length 16S rRNA gene sequencing and automated taxonomy assignment (AutoTax). mBio. 2020;11(5):e01557-20.
Epstein SS. The phenomenon of microbial uncultivability. Curr Opin Microbiol. 2013;16:636–42.
Estrella-González MJ, et al. Uncovering new indicators to predict stability, maturity and biodiversity of compost on an industrial scale. Bioresour Technol. 2020;313: 123557.
Federhen S. The NCBI taxonomy database. Nucleic Acids Res. 2012;40:D136–43.
Ghosh A, Mehta A, Khan AM. Metagenomic analysis and its applications. J Bioinf Comput Biol. 2019;3:184–93.
Handelsman J. Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev. 2004;68:669–85.
Janda JM, Abbott SL. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls. J Clin Microbiol. 2007;45:2761–4.
Lagesen K, et al. (2007) RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35:3100–8.
Lee SJ, et al. Comparison of microbiota in the cloaca, colon, and magnum of layer chicken. PLoS ONE. 2020;15(8): e0237108.
Maidak BL, et al. The RDP-II (Ribosomal Database Project). Nucleic Acids Res. 2001;29:173–4.
McDonald D, et al. An improved greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 2012;6:610–8.
McDonald D, et al. Greengenes2 unifies microbial data in a single reference tree. Nat Biotechnol. 2024;42:715–8.
Park SC, Won S. Evaluation of 16S rRNA databases for taxonomic assignments using mock community. Genomics Inf. 2018;16: e24.
Parks DH, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36:996–1004.
Parks DH, et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 2021;50(D1):D785-94.
Parte AC. LPSN—list of prokaryotic names with standing in nomenclature. Nucleic Acids Res. 2014;42(D1):613–6.
Poli A, et al. Microbial diversity in extreme marine habitats and their biomolecules. Microorganisms. 2017;5(2):25.
Pruesse E, et al. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007;35:7188–96.
Quast C, et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013;41:D590–6.
Rosado D, et al. Disruption of the skin, gill, and gut mucosae microbiome of gilthead seabream fingerlings after bacterial infection and antibiotic treatment. FEMS Microbes. 2023;4:xtad011.
RStudio Team. RStudio: Integrated Development for R. RStudio, PBC, Boston, MA. http://www.rstudio.com/ (2020).
Santamaria M, et al. Reference databases for taxonomic assignment in metagenomics. Brief Bioinf. 2012;13:682–95.
Sattley WM, Madigan MT. Microbiology. eLS 2015; 1–10.
Schloss PD, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75:7537–41.
Schoch CL et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database 2020; baaa062.
Singer E, et al. Next generation sequencing data of a defined microbial mock community. Sci Data. 2016;3(1): 160081.
Vaitilingom M, et al. Contribution of microbial activity to carbon chemistry in clouds. Appl Environ Microbiol. 2010;76:23–9.
Větrovský T, Baldrian P. The variability of the 16S rRNA gene in bacterial genomes and its consequences for bacterial community analyses. PLoS ONE. 2013;8(2): e57923.
Wang B, et al. The human microbiota in health and disease. Engineering. 2017;3:71–82.
Wang QG, et al. Naïve bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007;73:5261–7.
Whitman WB. Bergey’s Manual of Systematic of Archaea and Bacteria. New Jersey, EUA: Wiley Online Library, 2015.
Yilmaz P, et al. The SILVA and “All-species living tree project (LTP)” taxonomic frameworks. Nucleic Acids Res. 2014;42:D643–8.
Acknowledgements
This work was supported by the European Union's Horizon 2020 Research and Innovation Programme under the Grant Agreement Number 857251.
Funding
Work co-funded by the project NORTE-01-0246-FEDER-000063, supported by Norte Portugal Regional Operational Programme (NORTE2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF). This study was supported by the "Contrato-Programa" (UIDB/04050/2020) funded by national funds through the FCT I.P https://doi.org/10.54499/UIDB/04050/2020.
Author information
Authors and Affiliations
Contributions
M.P.C. created the database and performed the analysis. N.A.F. supervised the work and helped in the analysis methodology. A.M.M. conceived the idea, performed the analysis and created the database webpage. All authors wrote the main manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
All authors are aware of the work developed in this manuscript and consent to participate.
Consent for publication
All authors give their consent to publish all the information that appears in this manuscript.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Cabezas, M.P., Fonseca, N.A. & Muñoz-Mérida, A. MIMt: a curated 16S rRNA reference database with less redundancy and higher accuracy at species-level identification. Environmental Microbiome 19, 88 (2024). https://doi.org/10.1186/s40793-024-00634-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40793-024-00634-w