Some thoughts on the database cleaning project

After reading up on the main 16S databases and their construction, several ideas and questions come to mind regarding our project. First of all, I think that we should try to get a general idea of the scope of the project. I think that our aim is to complement the existing tools, rather than create a database from scratch. Nonetheless, we would still need to decide what to implement, and what we should include in our project.

Each database team has implemented a few unique features that we might want to consider ourselves. For example, after cleaning up the taxonomic names, are we going to perform further cleaning steps? Both Greengenes and RDP are chimera checked, but it seems that chimera checking is difficult to implement in practice.1 The SILVA team is concerned that chimera checking would lead to actual biological sequences being thrown out, and does not perform chimera checking.2

Alignments are performed on each database. I assume that we will align our database following the cleaning stage. An alignment might also help us evaluate the quality of our databases. Several questions come to mind, such as which aligner(s) to use, and should we use the same method as the original database, or the method that we judge to be the most appropriate? Aligning 16S sequences can be tricky, as the V6 region has a higher variability than the other hypervariable regions. One approach for addressing this problem is to perform a pairwise alignment to a reference for the V6 region of each sequence to be used.3

Here is a brief list of some other considerations that have come to mind:

TL;DR and Summary

