Kylepedia · Electronic Lab Notebook

Type Material in the NCBI Taxonomy Database

Details

Title: Type Material in the NCBI Taxonomy Database

Overview

Overview of the type material in the NCBI taxonomy database. Provides overview of the metadata stored, how type material is identified from other resources, how it can be used for quality control, etc.

Background Notes

  • Sequence from type
    • Should have highly reliable taxonomic identifications
    • Feasible to keep type material information current
  • NCBI curation:
    • NCBI provies synonomy updates, but relies on submitters for original identification and annotation
      • User annotation leads to errors
        • NCBI suppresses or flags “egregious” errors
        • Type material can help find some of these errors
          • “Backbone” of reliability
      • A major problem are taxa are described prior to receiving valid name
        • Original entries are rarely updated

Technical Details

The paper is mainly an overview of the resource, but they do describe the results of a few quality control measures. These include

Different Codes

  • Zoological code
    • Does not regulate names above the family level
    • Fungi dual nomenclature system:
      • Different names were used for asexual and sexual forms of the same species
      • No longer in use, but it will take time to correct existing resources
  • Prokaryotes
    • Valid names published in IJSEM
      • New species require a type strain to be deposited in two separate culture collections
        • These strains are frequently exchanged between different collections, and are known as being “co-identical”
    • Names not in IJSEM:
      • “Effectively” published, not considered valid
    • Names lacking cultures specimens:
      • Candidatus
      • May still have genomes in GenBank (from single-cell sequencing, metagenome assembly)
  • Botanical code:
    • Traditionally governs cyanobacteria, even though they are bacteria
  • Viruses:
    • No code of nomenclature
    • Annual list of approved species names
      • Names are proposed to a committee rather than published in literature
    • No type material

Species in GenBank

  • Variety of sources
    • Cultures:
      • Some of these provide species level names
    • Others:
      • May be given informal names
        • 66% of sequences in the Taxonomy database
  • Unpublished names appear in search and retrival but not public pages

Type Material in the Taxonomy Database

  • Type material is tracked and flagged at species or subspecies level nodes
  • Source for type data:
    • For prokaryotes:
      • IJSEM and Bergey’s manual are frequently used
        • Not in computer readable form
      • StrainInfo is the largest collection of type data
      • LPSN is more reliable than StrainInfo, and is the main reference
      • NamesForLife is smaller than LPSN, but is considered the most accurate
        • Not used by NCBI due to licensing costs
      • Prokaryotic Nomenclature Up-to-date provides the most comprehensive machine-readable data
    • For fungi:
      • Mycobank
      • Index Fungorum
      • Others sources include culture collections and museums
    • Other eukaryote microbes:
      • Less organisms characterized
      • Resources such as ZooBank and IPNI may be used in the future
  • Well characterized organisms not meetings type requirements:
    • May be designated as “reference” specimens
  • Type material can be used for quality control

Misc

  • RefSeq Targeted Locus Reference Sets
    • Curated sets of fulllength reference sequences from type for ribosomal RNAs
  • Prokaryote phylogeny:
    • NCBI provides two trees:
      • ‘K-mer tree’ based on 28-mers
      • ‘Marker tree’ based on protein BLAST scores for ribosomal proteins
      • Type strains can be used to help identify taxa with incorrect identifications:
        • E.g. Look for taxa clustering in the wrong place
      • For species without type strains, a closely related species (based on sequence similarity) can be used as a “proxytype”

Results

  • Cross-validation of type strains in trees revealed the presence of mislabelled sequences
    • These were “suppressed” in the taxonomy
  • The author posted a comment post-publication of further average nucleotide identity (ANI) comparisons between co-identical type strains1
    • Surprisingly low ANI was observed for some species

Discussion

This article provided good examples of how type strains can be used to validate reference sequences present in a taxonomy. I would like to explore this for our project.

Sources

1: Author’s Note


Rim Db A Taxonomic Framework For

Details

Title: RIM-DB: a taxonomic framework for community structure analysis of methanogenic archaea from the rumen and other intestinal environments

Overview

The paper describes the creation of a 16S database specialized for methanogenic archaea.

Background Notes

  • Certain methanogens are underrepresented in 16S reference databases
    • Other sources of 16S reference sequences:
    • “Full-length PCR-amplified sequences from cultivation-independent studies”
    • E.g. Clone library-based investigations
    • Sequences may be of lower quality

Technical Details

Overview of the database

  • Rumen and Intestinal Methanogen Database (RIM-DB)
    • Specialised taxonomic framework
    • 16S rRNA gene sequences:
      • Ruminal methanogens
      • ”… other intestinal environments where methanogens are … important”
    • Quality-Control:
      • > 1200bp archael sequences
      • Chimeras detected and removed
    • SINA alignment, with manual editing for poorly aligned sequences

Selection of Sequences

  • Primary source:
    • Archaea from SILVA (v. 111)
  • Secondary sources:
    • 28 sequences found in literature not in SILVA (source not specified)
    • 20 from this study (how where these generated?)

Quality control

  • Chimeras
    • Detection performed with Qiime
    • UCHIME, Blast-based algorithms in reference mode
      • Greengenes (05_13) used as reference sets
    • Removal:
      • Manual inspection of 5’ and 3’ ends using BLAST
      • Duplicate sequences were removed unless they were from isolates or stable cultures
  • Alignments were corrected manually if required

Sequence Alignment and Phylogenetic Tree Creation

  • Alignment
    • SINA aligner
    • Manual changes made for problem areas
  • Phylogenetic Tree
    • Tree created using RAxML
    • Maximum likelihood method
  • Taxonomy
    • Based on Greengenes
    • Strain level if possible
    • Most sequences assigned to the species level
  • Clustering:
    • ”… poorly-resolved clades were binned into defined groups to improve the accuracy and detail of taxonomic assignments …”
      • Based mainly on sequence identity and bootstrap support
  • Other Analyses:
    • seqinR:1
      • Information content at each alignment position
      • Provides estimate of taxonomic signal
  • Benchmarking:
    • Databases:
      • RIM-DB, SILVA, Greengenes
    • Isolate dataset:
      • 24 methanogen sequences from SILVA, or published in this study
      • V6–V8 regions (>1000bp)
    • Classifyer:
      • Qiime, BLAST-based approach
        • Does this use lowest common ancestor? Not clear on Qiime page
    • Amplicon dataset
      • Combination of other sequencing datasets:
      • Partial 16S rRNA gene sequences (nucleotide positions 935–1385; E. coli numbering)
      • 24 ascension numbers provided in supplemental, but most of the details were not given
        • E.g. Sequencing platform, sample type, study, etc.

Results

  • Genus level classification:
    • Few differences between databases or datasets
  • Species level classification:
    • RIM-DB provided the most assignments at this level
    • SILVA performed similar to RIM-DB
  • Strain
    • Only RIM-DB contained strain level assignments

Discussion

The authors stressed that sequencing errors and sequence similarity makes it difficult to delineate certain group. Therefore, they grouped “difficult-to-delineate” sequences together.

Some Notes:

I grabbed this paper to see how other teams (other than SILVA, RDP, Greengenes) are contructing specialized 16S databases. Not being familiar with methanogens, I can’t evaluate how comprehensive their database is. I think that the benchmarking methods could have been improved. The isolate dataset doesn’t reflect a realistic dataset, and few details were provided on the amplicon dataset. Does their database provide superior performance over a range of datasets that methanogen researchers would be interested in? How does it perform on more complex communities? Furthermore, I would be hesitant to use this database for diversity calculations, as it is limited to a small number of organisms, and doesn’t include bacteria.

I liked the inclusion of the analysis of the information content contained in the region of interest of the 16S rRNA gene. It would have liked them to have expanded on this. For example, how does the information content differ between different taxonomic groups? Nonetheless, I’m glad that they mentioned the R tool that they used, as it could be useful for our work. I’m not familar with clone library-based technologies, but they could be worth exploring for inclusion in our database.

Sources

1: seqinR


Pat Schloss on rRNA Reference Database Sequence Alignment

Source Information

Source: Mothur blog

PD Schloss

Aug 4, 2015

No-greengenes-hasnt-improved

Overview

This blog post suggests some food for thought for the alignment of rRNA reference databases. Pat Schloss is pretty candid here, and provides some strong arguments in favour of using a “reference-based” alignment approach, rather than approaches that model the secondary structure of the 16S rRNA gene.

Notes

  • Why do reference alignments matter?
    • Aims to ensure that positional homology across all sequences –> we want to make sure that we are comparing the same regions of the sequences (important for diversity calculations)
    • Provides “whole dataset alignment” at a fraction of the computational costs of multiple sequences alignments
    • Intermediate between pairwise alignments and MSAs regarding the number of OTUs, and distance measurements
  • Common methods used for aligning rRNA sequences, while preserving secondary structure
    • Reference-based (e.g. NAST, Mothur’s align.seqs)
      1. Generate reference alignment
      2. Align sequences to reference alignment
    • Model-based (e.g. Infernal)
      • rRNA secondary structured modelled
  • rRNA reference database alignments
    • Reference-based:
      • SILVA
    • Model-based
      • Greengenes
      • RDP

Comments

Dr. Schloss’ results indicate that the model-based approaches result in the loss of a significant # of bases in the hypervariable regions. For example, the Greengenes alignments resulted in 18 sequences being discarded on average compared to SILVA’s. This surprised me, as I was under the impression that Infernal was the goto tool for rRNA alignments. This post could also prove to be helpful in the future, when we perform alignments on our database, as the source code is provided.


Scaling metagenomic classification

Details

Title: Scalable metagenomic taxonomy classification using a reference genome database

Citation:
Ames, Sasha K., et al. “Scalable metagenomic taxonomy classification using a reference genome database.” Bioinformatics 29.18 (2013): 2253-2260.

Overview

Presents a new taxonomic classification algorithm, with the aim of scaling well to analyze large metagenomic (typically shotgun)datasets. This methods has very high memory requirements (0.5 - 1 terabytes). The authors aim was to classify to the species level when possible.

Tool is a part of the Livermore Metagenomics Analysis Toolkit (LMAT).1

Background Notes

  • Existing bioinformatic approaches for addressing scalability
    • Query size reduction:
      • Read assembly/clustering
        • Can improve taxonomic signal
        • Error prone
    • Reference database size reduction
      • Only store markers from more informative sequences
      • Improves scalability, but discards potentially useful seqs
    • Faster database search approaches
      • Larger search seeds
        • Parameter choice crucial

Technical Details

  • Reference database:
    • Contains genome sequences, and associated taxonomic identifiers
    • Input:
      1. NCBI taxonomy tree
      2. Ref genome seq db (partial and complete microbial seqs, incl. plasmids, viruses, protists, etc.)
      3. Mappings between 1 and 2
  • Scoring Read Taxonomic IDs:
    • Score = Proportion of k-mers of the read from given taxon
    • Normalized by proportion of k-mers of a random read that also belong to the taxonomy node
  • Assigning Taxonomic Ranks
    • Combines LCA selection with the read label score evaluation.
      • Most specific label used, such that no other label has similar score (+/- 1 S.D.)
      • Conflicts resolved by going to less specific taxonomic ranks
  • Test Data
    • Reference data:
      • “Full” k-mer/taxonomy db
      • Smaller db limited to marker records
    • 3 simulated datasets:
      • Metasim, 100bp PE reads, Illumina error model, 1 M reads
      • Virus, prokaryote, fungi, protists ref. seqs.
      • Bacterial:
        • “Equal concentrations” of the 100 bacterial strains
        • 75 distinct species
        • NOTE: What does “equal concentrations” mean?
          • Number of seqs?, genome coverage?
    • 3 real datasets from SRA:
      • viral, human microbiome, single species “raw read metagenome”
      • Used to evaluate runtime
    • 150 giga-base Tyrolean Iceman dataset
      • Used to evaluate runtime
  • Test software
    • Genometa, PhymmBL, MetaPhlAn
      • Existing database used for each tool
        • Genometa and PhymmBL dbs were too hard to adapt for this study

Results

  • LMAT showed highest species-level accuracy
  • Significantly fewer species-level assignments with marker reference db (40.4%) vs full db (74.2%)
  • LMAT was faster for all

Some Thoughts

This approach could prove to be beneficial for centers that analyze large shotgun metagenomic datasets, on an ongoing basis. In practice,the memory requirements could be prohibitive for most labs. It also would have been nice to see the comparison to other tools be based off of the same reference database.

Here are a few things that I found noteworthy:

  • Using the full reference database led to far more reads being classified than when using only the marker-based database
    • This should be a consideration if we benchmark our tool on shotgun metagenome data following the 16S benchmarks
  • Their approach for assigning a unique rank (no other ranks given a score within 1 S.D. of the rank in question) to a read looks like we could adapt something similar for our classifier
    • It might be worth digging into the supplemental info. to verify that the score was normalized using a single random read
      • Would a better normalizing approach use multiple reads?
      • What about bootstrapping?
      • Is 1 S.D. the best threshold? Might be worth trying a range of cutoffs

Sources

1: LMAT


Downloading from the SRA

How to download SRA fastq files and metadata

I find the documentation for the SRA fairly confusing, and performing simple tasks such as download files from the linux CLI can be tricky.

Here’s how to download SRA fastq files and metadata from the CLI. This requires the SRA Toolkit for downloading FastQ files. http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

For PE reads, use the –split-files option. Otherwise, the forward and reverse reads will be concatenated together.

fastq-dump --split-files SRR1561863

To download the metadata for SRR1561863, use wget. The following commands will download the metadata to xml or csv.

wget -O ./SRR1561863_info.csv 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRR1561863' wget -O ./SRR1561863_info.xml 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=xml&term=SRR1561863'

SRR1561863 can be replaced with the desired ascension number.

Sources:

https://nsaunders.wordpress.com/2011/12/22/sequencing-for-relics-from-the-sanger-era-part-1-getting-the-raw-data/


Impact of Training Sets

Details

Title: Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys

Citation: Werner, Jeffrey J., et al. “Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys.” The ISME journal 6.1 (2012): 94-103.

Overview

This paper looks at the impact of 16S training sets on classifier performance. Specifically, they evaluated the effect of different numbers of reference sequences, different taxonomy types, and reference trimming on classifying sequences from 5 separate 16S sequencing projects. Unfortunately, they didn’t use any datasets that could be used to determine accuracy (ie: simulated, or spiked samples).

Background Notes

A past study demonstrated that the RDP classifier, Simrank, and DNADIST performed the best on classifying sequences using the Greengenes database.1 Tools such as the RDP classifier don’t use positional information, so reference sequence trimming was expected to improve performance.

Technical Details

  • Datasets:
    • 16S sequencing data:
      • Human Microbiome: multiple different organs
      • Mouse gut
    • Python gut (time-series)
    • Mouse gut
    • Soil
    • 454 platform; V1,V2 hypervariable region
    • Reference data: Greengenes
  • Software:
    • Sequence processing: Qiime2
    • Classification: RDP (naive Bayes) using Mothur

Results

  • Training set size:
    • Improved taxonomic depth for larger training sets
    • Fewer unclassified sequences for larger training sets
  • Taxonomy:
    • Similar performance for RDP and Greengenes taxonomies when used on same input sequences
    • They mentioned that the taxonomy used might affect abundance-based metrics
  • Trimming:
    • Trimming reference data to amplified region improved classification
    • Increased the confidence scores obtained for classifications
    • Note: I think this could be particularily useful when looking for pathogenic organisms
  • Clustering:
    • No major differences observed when clustering input sequences at 97% or 99%
    • Unclassified sequences typically clustered together
    • This suggests that they were underrepresented in the reference databases

Discussion

The authors recommended using larger and more diverse reference databases. Conversely, whether or not reference trimming should be done, should be left to the researchers. A potential downside of trimming is that certain regions may be more difficult to trim, and that it is harder to manage data from samples with different amplified regions.

Sources

1: Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers

2: The tools and methods used to trim the reference sequences were not covered in much detail in the methods. They mentioned in the results that Qiime was used to process sequences, but the exact commands and methods were not covered. It would have been useful to know how this was done, so that we could try the same methods.


All Species Living Tree Project

Details

Title: Chapter 3 - The All-Species Living Tree Project

Citation: Yarza, Pablo, and Raul Munoz. “The all-species living tree project.” Methods in Microbiology 41 (2014): 45-59.

Overview

This paper is the third chapter from a journal book series on prokaryote systematics. It summarizes how the database was created and curated, and describes some of the resources that they provide that are based on the database. The team describes the project as follows:

The aim of the project is to reconstruct separate and curated 16S and 23S rRNA datasets and trees spanning all sequenced type strains of the hitherto classified species of Archaea and Bacteria.

The project is guided by the editors of the journal Systematic and Applied Microbiology The ARB/SILVA, and List of Prokaryotic names with Standing in Nomenclature (LPSN) teams take care of the technical details.

Technical Details

Data Collection

  • Seqs mainly from SILVA dbs (type strains only), which are mainly from INSDC (ENA specifically)
  • Manually updated using Bergey’s Manual, EMBL, IJSEM for missing strains

Background

Prior to describing the project itself, the paper describes the existing database resources for microbiology, such as sequence databases (INSDC), 16S specific databases, nomenclature, and type strain information. The type strain part is noteworthy, as type strain repositories are expected to have higher standards for the information provided to them (although they noted that problems still exist) by the teams who are depositing a new strain.

They also describe the main sources of error for the taxonomic data entered into biological databases: (1) different repositories containing different names for the same type strains, and (2) the large number of strains (> 1 million) and repositories (~600).

Taxonomy

  • Merged existing SILVA taxonomy with The Taxonomic Outline of Bacteria and Archaea (TOBA) 1, NCBI, and Prokaryotic names with Standing in Nomenclature (LPSN)2
  • “Suggestions” from LTP team members also used

Filtering

  • Only one sequence per type strain retained if multiple seqs available for a type species
    • The sequence judged to be of highest quality retained
  • Usually only sequences judged to be “high-quality” were retained, but lower quality seqs were permitted for under-represented groups* Depends on specific database:

Notes:

A lot of the curation steps that they do are done manually by the team members. The paper doesn’t describe what these manual steps actually are. It would be good to see if these are documented somewhere.

The SILVA databases are separate from the LTP tree. The LTP tree is limited to type strains, while SILVA is more inclusive.

1: TOBA 2: LPSN


RDP DB

Details

Title: Ribosomal Database Project: data and tools for high throughput rRNA analysis

Citation: Cole, James R., et al. “Ribosomal Database Project: data and tools for high throughput rRNA analysis.” Nucleic acids research (2013): gkt1244.

Overview

Paper summarizes the RDP project, its associated tools, database curation, etc. The paper was written for someone unfamiliar with the project, and povides a good overview.

Technical Details

Data Collection

  • Seqs mainly from INSDC (ENA specifically)
    • >= 500 bp 1
    • Majority of seqs are incomplete
      • 85%,97% of bacterial and archael sequences in RDP directly isolated from env samples

Taxonomy

  • Primary: Bergey’s
    • Assignment based on db_xref from INSDC
    • Due to poor quality of annotation:
      • Updated using List of Prokaryotic names with Standing in Nomenclature (LPSN)2
    • Synonyms are obtained from Bacterial Nomenclature Up-to-Date3
  • Taxonomy compared with All Species Living Tree Project (phylogenetic assesment)4
  • Discrepancies
    • RPD created tree
    • Accepts best supported clades

Filtering

  • RDP seqmatch k-nearest neighbor
    • Accepts seqs clustered with archaea, bacteria, or fungi
  • Poor alignments discarded
  • Chimera screening performed with UCHIME

Alignment

  • Uses Infernal
    • Incorporates secondary structure
    • New seqs can be added to existing alignment
    • Can align V6 region
  • RDP corrects start stop positions for rRNA genes
  • Fungi
    • 5.8S and 28S aligned with LSU aligner
    • ITS “evolves too rapidly for global alignment
      • RDP treats ITS as an insert

RPD Browser

  • RDP provides NCBI LinkOut data (corresponding NCBI records)
  • User specified filtering:
    • Type strain, cultured, full sequences, quality
  • Metadata:
    • Associated projects, publications
    • CNVs

Other RDP Tools of Interest

  • SeqMatch: k-nearest neighbor
  • Classifier: Naive bayes
    • Training stuff can test consistency of training sets, and flag possible taxonomy errors5
  • Probe match: queries database for matches to nucleotide seq (primers)
  • Aligner
    • Based off of Infernal

1: I think having partial sequences would have a negative impact upon classification if the reference sequence doesn’t contain the hypervariable region for the sequence to be classified. To the best of my knowledge, RDP does not exclude unapplicable reference sequences during training.

2: LPSN

3: Up-to-date

4: living tree

5: Might be of use for us for evaluating taxonomies?


Resolving Taxonomic Names

Details

Title: The taxonomic name resolution service: an online tool for automated standardization of plant names

Citation: Boyle, Brad, et al. “The taxonomic name resolution service: an online tool for automated standardization of plant names.” BMC bioinformatics 14.1 (2013): 16.

Web interface: http://www.iplantcollaborative.org/ci/tnrs

Source code: https://github.com/iPlantCollaborativeOpenSource/TNRS/

Overview

This tool looks to be very promising for resolving the incorrect taxonomic names in the 16S databases that we are using. It was created to resolve plant names, but the source code is available on Github. The authors stress repeatedly that it could be adapted for non-plant taxonomies. Furthermore, the algorithm looks to be quite robust, and uses multiple methods of identifying and correcting errors.

Description

Types of resolved problems

  • mispellings
  • lexical variants (different ways of writing the same name)
  • homonyms (identical names or different taxa)
  • homotypic synonyms (names changes due to changes in classification)

Supports the combining of different external taxonomic sources. We might consider doing this to resolve differences between rRNA dbs (eg: SILVA, Greengenes), but the authors advise against this. They recommend assigning a priority to one source over another.

NOTE: The authors advise against using the NCBI taxonomy due to poor quality1

Technical Details

Database

  • periodically updated
  • MySQL
  • supports multiple taxonomies

Algorithm

Pre-Processing

  • Strips family names that are pre-pended to the species name
  • Corrects all-caps names
  • Identifies names with complete matches to database, and excludes these from further processing

Name parsing

  • uses a series of regular expressions
  • uses a “relaxed mode” that allows special characters (eg: ?)

Fuzzy Matching

  • substitutes specific characters or character pairs for others
  • takes into account nomenclature conventions
  • calculates estimated distance based on the # of insertions, deletions,and substitutions (very similar to bioinformatics!)

Post-Processing

  • Calculates match scores
    • longer strings are scored higher
    • based off of estimated distance
    • compares transformed name score to the original name
  • returns JSON formatted results

Interface

  • Accessed via RESTful API
  • provided r script

Other Tools

Tropicos

Catalog of Life

Global Names Resolver

Software Requirements

  • OS: Linux

  • Programming languages: PHP, MySQL, Ruby, Java

Some Thoughts

This looks very promising for our purposes. I think we could adapt this to be used for bacteria. A potential problem could be the different languages used. I haven’t used Java recently, and am unfamilar with PHP and Ruby. I took a databse course in college, but we only covered basic SQL commands, so I haven’t actually created a database myself

1: iPhylo

2: sourcecode


Some thoughts on the database cleaning project

Click here for TL;DR

After reading up on the main 16S databases and their construction, several ideas and questions come to mind regarding our project. First of all, I think that we should try to get a general idea of the scope of the project. I think that our aim is to complement the existing tools, rather than create a database from scratch. Nonetheless, we would still need to decide what to implement, and what we should include in our project.

Each database team has implemented a few unique features that we might want to consider ourselves. For example, after cleaning up the taxonomic names, are we going to perform further cleaning steps? Both Greengenes and RDP are chimera checked, but it seems that chimera checking is difficult to implement in practice.1 The SILVA team is concerned that chimera checking would lead to actual biological sequences being thrown out, and does not perform chimera checking.2

Alignments are performed on each database. I assume that we will align our database following the cleaning stage. An alignment might also help us evaluate the quality of our databases. Several questions come to mind, such as which aligner(s) to use, and should we use the same method as the original database, or the method that we judge to be the most appropriate? Aligning 16S sequences can be tricky, as the V6 region has a higher variability than the other hypervariable regions. One approach for addressing this problem is to perform a pairwise alignment to a reference for the V6 region of each sequence to be used.3

Here is a brief list of some other considerations that have come to mind:

  • Which database? All? One?
  • 16S only? Others such as ITS or 28S?
  • How often would we update?
    • Projects such as SILVA can be monthly, Greengenes last updated 2013
  • How much of the pipeline are we going to implement?
    • Eg: just clean up the taxonomy, and align?
    • Silva and RDP allow the user to look at specific regions
  • I’ve seen some users request databases that contain specific hypervariable regions in order to improve classifier training
    • I’m not aware of this being implemented anywhere. Should we?
  • Which taxonomy?
    • Should we stick to one, or provide several options?
    • Are we going to enforce having monophyletic groups (such as Greengenes)?
      • Some common bacteria (e.g. E. coli.) won’t be classified if this is enforced
  • What output formats will we provide?
    • Both Mothur and Qiime have specific formats for their taxonomy files
    • Seeing that these tools are used extensively, I think we should provide Mothur and Qiime formatted taxonomies

TL;DR and Summary

  • Which database? All? One?
  • How much of the pipeline are we going to implement?
  • Alignment isn’t trivial
  • What is the scope of the project
  • ie: are we implementing more than taxonomic name resolution and other cleaning steps?
  • Which taxonomy/taxonomies to support? Enforce monophyly?
  • Output format? Mothur and Qiime as well?

1: Source

2: Source

3: Source


Website Todos

Stuff todo for this website:

Near term

  • [x] Create task list layout, and lists tasks lists associated with each project
  • [x] Automate the creation of posts in project directories; have script call RStudio to edit the file
  • [x] Add ability to create drafts using script
  • [ ] Link to repositories used in each project
  • [ ] Add support for knitr and R Markdown content 1
  • [ ] Change markdown parser to Pandoc 2
  • [x] create basic site layout

Project Pages

  • [x] Show list of posts in project subfolders (categories)
  • [ ] Show task lists and progress on index page for each project

Medium Term

  • [ ] Show publications 3
  • [ ] Add math symbol support
  • [ ] Add search ability
  • [ ] Look into using JSON for basic database functionality4
    • Index for quick notes (posts with short notes for articles I’ve read either partially or completely)
    • Load data from CLI todo list for tasks, progress, etc.
    • Load data from doiresolver program
  • [ ] Read up on variable inheritance (https://github.com/jekyll/jekyll/issues/3307) (http://stackoverflow.com/questions/23174258/can-yaml-front-matter-be-inherited-in-jekyll)

Possibly Useful

  • [ ] Page ordering 5
  • [ ] Symlinks for organization and editing 5

1: carlboettiger;carlboettiger; chepec;bryer; yihui;ropensci;jonzelner; jfisher

2: Pandoc

3: publications

4: [Link]http://jekyllrb.com/docs/datafiles/ Link Link

5: bruth