Accounting for systematic error in RNA-seq based analysis of allele-specific expression. Rita M. Graze1,4, Luis G. Léon-Novelo2, George Casella (posthumous)3,4, Justin M. Fear1,4, Lauren M. McIntyre1,4. 1) MGM, UF, Gainesville, FL; 2) Mathematics, UL-LFT, Lafayette , LA; 3) Statistics, UF, Gainesville, FL; 4) Genetics Institute, UF, Gainesville, FL.

   Genetic differences in transcript regulation can arise from sequence variation in the regulatory regions of a gene itself (cis) or in regulatory or coding regions of trans acting factors or through indirect or epistatic effects. In diploid organisms expression from two, potentially different, copies of each gene can contribute to transcript abundance and to subsequent protein production. Alleles expressed in a common cellular environment can differ in cis regulatory sequence, but share a common pool of trans acting factors. For this reason a common method of identifying cis regulatory differences is to perform an analysis of allele-specific expression (ASE) and identify cases where alleles are expressed at different steady-state transcript levels within an outbred or F1 genotype (allelic imbalance, AI). RNA-seq is the technology most frequently used to analyze ASE. Error variance and systematic error are important issues in RNA-seq based analysis of ASE and the binomial test is insufficient to address these issues. DNA controls are an excellent solution to these issues, but are not practical in large scale experiments. We show that regions of sequence similarity in the genome result in mapping ambiguity and explain map bias found in simulations. We find that these regions are associated with detection of AI. This results in an inflation of estimates of the prevalence of cis variation if no control is used. Ambiguity is not the only source of bias identified by DNA controls. We propose a flexible Bayesian model, applicable to a wide variety of experimental designs. The model can use information from different sources, such as DNA controls or simulations, to correct for systematic error. The proposed model performs well compared to the standard binomial test. We use performance of the improved model, plus our increased understanding of the role of genome ambiguity, to optimize analysis plans for ASE studies.