Impact of Training Sets · Kylepedia

Impact of Training Sets


Title: Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys

Citation: Werner, Jeffrey J., et al. “Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys.” The ISME journal 6.1 (2012): 94-103.


This paper looks at the impact of 16S training sets on classifier performance. Specifically, they evaluated the effect of different numbers of reference sequences, different taxonomy types, and reference trimming on classifying sequences from 5 separate 16S sequencing projects. Unfortunately, they didn’t use any datasets that could be used to determine accuracy (ie: simulated, or spiked samples).

Background Notes

A past study demonstrated that the RDP classifier, Simrank, and DNADIST performed the best on classifying sequences using the Greengenes database.1 Tools such as the RDP classifier don’t use positional information, so reference sequence trimming was expected to improve performance.

Technical Details



The authors recommended using larger and more diverse reference databases. Conversely, whether or not reference trimming should be done, should be left to the researchers. A potential downside of trimming is that certain regions may be more difficult to trim, and that it is harder to manage data from samples with different amplified regions.


1: Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers

2: The tools and methods used to trim the reference sequences were not covered in much detail in the methods. They mentioned in the results that Qiime was used to process sequences, but the exact commands and methods were not covered. It would have been useful to know how this was done, so that we could try the same methods.