Create your own conference schedule! Click here for full instructions

Abstract Detail

Comparative Genomics/Transcriptomics

Sutherland, Brittany [1], Tiley, George [2], Barker, Michael [1].

Characterization of ancient whole-genome duplications from genomic and transcriptomic data: a machine learning approach.

Whole-genome duplication (WGD, aka polyploidy) is often a cyclic process, with extensive gene loss or divergence following the WGD and eventually resulting in a return to diploidy. In these diploidized taxa, it may be difficult to infer their polyploid ancestry. The most common method for inferring ancient WGDs, Ks plots, relies on visualization of synonymous substitution rates across all duplicate genes in the genome. A characteristic peak can be found in these distributions, resulting from the burst of gene duplications that occurred with WGD. Although conceptually simple, interpretation of these Ks plots is an inexact science. Peak height and prominence can be affected by gene birth and death rates, age of the WGD event, and data quality. Peaks are often inferred by eye and may be prone to observer bias. Quantitative methods to identify peaks by fitting normal distributions to Ks plots frequently identify multiple significant peaks and make it difficult to assess which may result from WGD. Despite difficulties in analysis and interpretation, Ks plots may reveal a wealth of information about WGDs, including the number and age of such events, and whether WGD resulted from auto- or allopolyploidy. Here, we present a new machine learning approach for the inference and characterization of WGDs. Using hundreds of thousands of simulated gene age distributions reflecting presence or absence of WGD, varying ages of duplication, and different types of duplication, we trained a set of machine learning models for WGD inference. With these models, we have achieved 100% accuracy against an empirical dataset of taxa previously analyzed by syntenic methods, and 94% congruence with classifications made with existing methods for a larger (~1400 specimen) dataset representing a wide sampling of green plants. Overall, our new machine learning approach provides a robust, repeatable, and objective method to infer ancient WGDs in genomic data.

1 - University of Arizona, Department Of Ecology & Evolutionary Biology, University Of Arizona, P.O. Box 210088, Tucson, AZ, 85721, USA
2 - Duke University, Department of Biology, Campus Box 90338, Durham, NC, 27708-0338, USA

Keywords:
none specified

Presentation Type: Oral Paper
Session: CG2, Comparative Genetics/Genomics II
Location: Tucson I/Starr Pass
Date: Tuesday, July 30th, 2019
Time: 3:45 PM
Number: CG2008
Abstract ID:701
Candidate for Awards:Margaret Menzel Award