Recent advances in NCBI's Eukaryotic Genome Annotation Pipeline and expansion to process RNA-seq data. Terence D. Murphy, Alexander Souvorov, Francoise Thibaud-Nissen, Eyal Mozes, Wratko Hlavina, Eric Engelson, Olga Ermolaeva, Alex Astashyn, Craig Wallin, David Managadze, Kim Pruitt, Paul Kitts, Michael DiCuccio. NCBI, NIH, Bethesda, MD.
Recent advances in sequencing technology are resulting in an explosion of genome sequences for a wide variety of taxa. Making use of this sequence barrage will require accurate and efficient methods for genome annotation, especially for protein-coding and non-coding genes. The NCBI eukaryotic genome annotation pipeline has been substantially redesigned in the last few years to help meet this need. The pipeline produces evidence-based models using transcript and protein alignments combined with ab initio prediction, and is being extended to include use of RNA-seq data. It is largely automated from the initial retrieval of genome, transcript and protein sequences from NCBI archival databases, to calculating and interpreting sequence alignments, providing validation and QA reports, and providing final genome annotation results that are integrated with the RefSeq and Gene databases. The pipeline utilizes assembly-assembly alignments to track gene annotations from one annotation run to the next, thus maintaining identifiers even when the assembly is updated. The final annotation product can include transcripts and proteins for which the sequence has been modified relative to the draft genome assembly in order to correct a truncating mismatch or frameshift and represent a more complete protein. The pipeline has been designed to annotate multiple organisms in parallel, with run times in the range of 1-5 days. This presentation will provide an overview of the annotation pipeline, our approach to integrate RNA-seq data, and an analysis of annotation results for test cases including D. melanogaster and human. We will be working with FlyBase to help re-annotate the genomes for 11 Drosophila species, which should greatly improve the cross-species analyses possible in this important genus.