Leveraging a knowledge base of Drosophila cis-regulatory modules for regulatory element discovery in diverged insect species. Kushal Suryamohan1, Majid Kazemian2, Jia-Yu Chen2, Yinan Zhang2, Marc Halfon1,3,4, Saurabh Sinha2. 1) Department of Biochemistry and Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo; 2) Department of Computer Science, University of Illinois Urbana-Champaign, IL; 3) Department of Biological Sciences, State University of New York at Buffalo; 4) Molecular and Cellular Biology Department, Roswell Park Cancer Institute, Buffalo, NY.

   Although growing numbers of insect genomes are being sequenced, defining the sequences involved in transcriptional regulation within these genomes remains a challenge. Most effective methods for cis-regulatory module (CRM) discovery rely either on empirical assays or computational models that rely on sequence alignment to closely related species and knowledge of CRMs or transcription factor binding sites for the organism being studied. The lack of well annotated databases for regulatory regions of DNA for insect species outside of the well-studied Drosophila genus makes such approaches intractable. We previously demonstrated success at computational CRM discovery in Drosophila using a supervised learning approach in which experimentally validated CRMs are used to train a CRM prediction algorithm, with as low as a 10% false-positive rate. We demonstrate here that these same Drosophila CRM training data can be leveraged to identify CRMs in diverged species such as the emerging model insects Nasonia vitripennis, Tribolium castaneum, Anopheles gambiae, and Apis mellifera. Examination of 16 predicted CRMs for regulatory activity in vivo in transgenic Drosophila showed positive regulatory activity in 12 of the 16 CRMs with 75% clearly associated with the expected gene and about 50% regulating gene expression in the expected pattern. Our results indicate that the extensive experimental CRM data that exists for Drosophila can be used to facilitate CRM discovery in distant insect species with sequenced genomes but little functional data, and suggests that core regulatory strategies have been conserved despite the lack of any clear non-coding sequence alignment.