*Corresponding author email: phillip.morin@noaa.gov
Running page head: Genomic mutation rate in cetaceans
Genomic mutation is the fundamental process driving evolutionary processes that result in contemporary species, subspecies, and populations. Thus, genomic studies of divergence, speciation, and evolutionary demographics rely on accurate estimates of mutation rate. Historically, mutation rates have been estimated using a small portion of the genome and/or averaged among multiple species within a family or order, but growing evidence indicates potentially high variability in mutation rate both across the genome and among species. Here, we examine variability in mutation rate within cetaceans by mapping reference-quality genomes generated from short-read or HiC sequencing data by the Cetacean Genomes Project and other published sources. We estimate mutation rate across a range of parameters, including choice of input reference genome, coding vs. non-coding regions, chromosome-specific, species-specific, and using a variety of read depths, divergence date estimate, and methods to calculate mutation rate, in order to determine which factors have the greatest effect on variability in mutation rate or its estimation. We find that mutation rate is most variable among species, and consequently most affected by the choice of reference genome used in the alignment pipeline. Other parameters had trivial effects on mutation rate estimates in comparison. These results indicate the importance of thoughtful selection of the reference genome used in future cetacean genomic studies, in order to ensure accurate downstream analyses of evolutionary processes.
As the fundamental process underlying evolution, mutation drives processes such as divergence among species and evolutionary adaptation. The rate at which mutations accrue will ultimately determine the rate at which species diverge from each other or adapt to changes in their environment.
Contemporary estimates of species divergence and adaptation that are central to evolutionary and population genetics rely on accurate estimates of mutation rate. Biases in this estimate may be reflected in many downstream analyses that shape our understanding of evolutionary biology, e.g. the timing of splits in genetic phylogenies, patterns of selective divergence as they relate to neutral variation, or the accuracy of evolutionary demographic analyses. They may also affect conservation priorities and management actions. For example, a recent analysis of baleen whale mutation rates based on wild pedigrees indicates that mutation rates in these species are higher than previously estimated based on phylogenetic divergence; correcting the mitochondrial mutation rate previously used in genetic estimates of pre-exploitation North Atlantic humpback whale abundance reduces those estimates by 86% (Suárez-Menéndez et al. 2023) and considerably shifts our understanding of the recovery of that population.
Evolutionary demographic analysis relates genetic diversity, or the number of mutations in a given genetic lineage, to abundance via the coalescent theory. In these types of analyses especially, the mutation rate estimate used may have a drastic impact on our understanding of historical abundance of a population - and changes in the mutation rate over time could also affect the observed trends in abundance over time. Our understanding of the relationship between population demographics and genetic diversity may shift with increased resolution into mutation rate variability, both within and among species. For example, early studies of mutation rate suggest a lack of direct relationship between mtDNA genetic diversity and population size (Nabholz, Glémin, and Galtier 2008, REFS 18 from Nabholz). Further, a growing body of evidence indicates that mutation rate varies across the genome, and that site-specific variability is only partially context-dependent (Ellegren, Smith, and Webster 2003; Hodgkinson and Eyre-Walker 2011). Finally, mutation rate has been shown to vary over evolutionary timescales (REF).
In addition to varying throughout the genome and through time, mutation rates have been shown to vary widely across species [e.g. Nabholz, Glémin, and Galtier (2009); Bergeron et al. (2023); REFS 14, 43, 39 from Nabholz]. Three broad hypotheses have been proposed to explain potential drivers of mutation rate variability among species. (1) The longevity hypothesis posits that mutation rate decreases with increased lifespan, allowing individuals to live for longer periods of time without adverse effects of the accumulation of harmful mutations over their lifetimes. (2) The generation time hypothesis, which posits that mutation rate XX. (3) The metabolic rate hypothesis posits that mutation rate increases with increasing metabolic rate (often proxied by body length) due to the oxidative effects of metabolism on the cell.
The effects of demography, phylogeny, and genome location on mutation rate highlight the need to quantify mutation rate variability within taxonomic groups in order to inform downstream evolutionary and conservation genetic analyses. In addition to these factors, there are multiple decision points along the bioinformatic pipeline from sequence data to aligned genome that may affect the accurate tally of genomic mutations and downstream estimation of mutation rate. However, there is limited research on how the selection of these input parameters affects mutation rate estimates.
The increasing accessibility of genomic sequencing technologies and high-quality reference genomes for marine mammals [Morin et al. in prep; Morin et al. (2020)] has precipitated a rise in the use of genomic techniques to address previously unanswered questions about the ecology and conservation of these species. Because these species are rare, remote, elusive, and protected, it is often difficult to obtain the sample size needed to accurately detect and describe ecological processes and patterns. In cases such as this, whole genome analyses increase sensitivity and power to describe genetic patterns using a smaller sample set, thus making it a potentially powerful tool for studies of evolutionary ecology in marine mammals. These genomic studies in marine mammals, e.g. descriptions of evolutionary phylogenies or historical demographics of species, currently rely on estimates of mutation rate that are averaged across a family of species and often based on only a small portion of the genome.
Recently, the development of a method to estimate mutation rate by quantifying the number of de novo germline mutations passed down through multi-generational family pedigrees has produced estimates that are considered more accurate and robust than estimates based on phylogenetic relationships (e.g. Bergeron et al. 2023; Suárez-Menéndez et al. 2023). While this method remains subject to some of the same sources of error (e.g. sampling bias and postmortem DNA damage, genotyping error), pedigree-based mutation rate estimates allow for direct quantification of inherited mutations without the need for the estimation of parameters such as divergence date or generation time. Although the pedigree approach is a more direct way of estimating mutation rate, for most taxa - particularly elusive marine species - parent and offspring genomes are not available for direct comparison. Further, while using this method with a large sample size can produce an accurate estimate of generational mutation rate, translating that to an annual mutation rate estimate would require knowledge of the age of parents, which is often difficult or impossible to obtain. Instead, researchers frequently make estimates of mutation rate based on the count of substitutions between two species relative to their divergence time. Pedigree-based estimates will reflect contemporary population demographics and the age of the parents, and may differ significantly from historical mutation rates; however, recent studies comparing the two methods indicate generally agreement when comparisons are drawn between generational mutation rates only (Campbell et al. 2021).
Here, we explore variability in phylogeny-based mutation rate within cetaceans, and describe the major drivers affecting that variability with the aim of providing insight and context for the estimation of an accurate and appropriate mutation rate in cetacean genomics studies. We examine process driven variability by estimating taxon- and chromosome-specific mutation rates, and comparing species-specific mutation rates with key demographic traits considered to affect mutation rate. We also examine potential drivers of observation error by varying key parameters in the pipeline used to align genomes and estimate a mutation rate. These input parameters include the choice of reference genome used for alignment, minimum and maximum read depth threshold for genotyping a locus, inclusion of coding or non-coding regions in the mutation rate estimate, divergence date estimate used in the mutation rate calculation, and whether or not ancestral heterozygosity is included in the mutation rate calculation.
We use publicly available whole genome sequence (WGS) data from 14 cetacean species representing the breadth of cetacean phylogenetic diversity. Shotgun WGS or Dovetail Hi-C chromatin-linked short-read data were obtained from NCBI Genbank using SRAToolkit (v. 3.0.7, https://github.com/ncbi/sra-tools), or from the European Nucleotide Archive (Table 1). Sequence files were quality filtered prior to formatted assemblyread mapping using BBduk in BBtools (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/REF).
In order to test for the effect of reference genome on the resulting mutation rate, we selected four reference genomes to represent a deeply diverged taxon, a recently diverged taxon, and an intermediate taxon (Table 2). The deeply diverged taxon is represented by the hippopotamus (Hippopotamus amphibius), and the intermediate taxon is represented by the sperm whale (Physeter macrocephaluS). We selected two species to represent the recently diverged taxon: the killer whale (Orcinus orca) is used for all odontocetes, and the North Atlantic right whale (Eubalaena glacialis) is used for all mysticetes.
We downloaded all reference genomes as fastq files from NCBI Genbank, and created index files using bwa (REF), bwa-mem2, and samtools (REFS). We mapped repeats with RepeatMasker (version and ref) and bedtools (REF). All steps described here, making up the pre-alignment pipeline, are outlined at: https://github.com/UW-WADE-lab/Whole-Genome-Alignment
Species | Abbreviation | Database | Accession | Recent reference | Generation Time | Length (m) | Lifespan |
---|---|---|---|---|---|---|---|
killer whale | Oorc | NCBI | SRR574970, SRR574978, SRR574980, SRR574981 | N/A | 25.7 | 5.55 | 90 |
North Atlantic right whale | Egla | NCBI | SRR11097130 | N/A | 35.7 | 15.5 | 67 |
Amazon River dolphin | Igeo | NCBI | SRR11430609 | N/A | 10.2 | 1.998 | 37.93 |
pygmy sperm whale | Kbre | NCBI | SRR11430611 | N/A | 12.1 | 3.125 | 32.43 |
Blainville’s beaked whale | Mden | NCBI | SRR13167975 | N/A | 23.4 | 4.11 | 60 |
vaquita | Psin | NCBI | SRR15435906 | N/A | 11.4 | 1.37 | 21 |
northern bottlenose whale | Hamp | ENA | ERR10908646 | N/A | 17.8 | 9.25 | 37 |
Rice’s whale | Bric | NCBI | SRR10331559 | Egla | 18.4 | 12 | 71.56 |
blue whale | Bmus | NCBI | SRR5665644 | Egla | 30.8 | 25.75 | 131.7 |
gray whale | Erob | NCBI | SRR12437599 | Egla | 22.9 | 12.5 | 77 |
Melon-headed whale | Pele | NCBI | SRR11097149 | Oorc | 19.6 | 2.75 | 47 |
minke whale | Bacu | ENA | ERR11040176 | Egla | 22.1 | 8.65 | 50 |
striped dolphin | Scoe | ENA | ERR11042965 | Oorc | 22.5 | 1.8 | 57.5 |
short-beaked common dolphin | Ddel | ENA | ERR11040185 | Oorc | 14.8 | 1.97 | 30 |
Species | Abbreviation | Database | Accession | Contig.N50 | Scaffold.N50 | Coverage |
---|---|---|---|---|---|---|
sperm whale | Pmac | NCBI | GCF_002837175.3 | 42.9 kB | 122.2 Mb | 248x |
hippotamus | Hamb | NCBI | GCA_030028045.1 | 50.3 Mb | 149.2 Mb | 33.0x |
killer whale | Oorc | NCBI | GCA_937001465.1 | 45.6 Mb | 114.2 Mb | 34.0x |
North Atlantic right whale | Egla | NCBI | GCA_028564815.2 | 37.5 Mb | 129.6 Mb | 32.0x |
All 14 genome read datasets were first aligned to the intermediate reference genome, the sperm whale. The alignment pipeline, regardless of sequencing approach, comprised six steps: 1) align sample genome to reference, 2) sort aligned genome, 3) remove duplicate sequences, 4) mask repeat regions, 5) index genome, 6) call genotypes and filter variants by read depth. All alignment scripts can be found at https://github.com/UW-WADE-lab/Whole-Genome-Alignment.
Step 1 differed for Illumina short-read and Arima Hi-C (Arima Genomics, Inc) sample data. We aligned forward and reverse reads from Illumina short-read data using bwa (REF). Arima Hi-C genomes, which comprise longer reads that contain chimeric reads from chromatin cross-linking prior to DNA extraction (Lieberman-Aiden et al. 2009), require an additional filtering step post-alignment. Therefore, we aligned Arima Hi-C forward and reverse reads separately using bwa-mem2, and filtered forward and reverse read files separately using custom perl scripts generated by Arima to identify and remove chimeras, then combined forward and reverse reads (https://github.com/ArimaGenomics/mapping_pipeline).
Steps 2-6 were identical for both types of sample genome. We sorted genomes using samtools (version, REF) and removed duplicates using Picard (version, REF). We removed previously identified repeat regions with bedtools (version, REF). We indexed the sample genome using samtools, called genotypes using bcftools (version, REF), and filtered variants to remove those with a read depth less than 10 or greater than 500, also using bcftools. Note that this read depth filter is less conservative than is normally used in an alignment pipeline; this allowed us to directly examine the effect of read depth on mutation rate later in the study.
We selected seven of the 14 sample genomes for a comparison of mutation rate estimates among reference genomes. These samples were additionally aligned to the hippopotamus and either the killer whale (odontocetes) or the North Atlantic right whale (mysticetes) using the pipeline described above (Table 1).
Mutation rate (\(\mu\)) can be estimated as genomic divergence (i.e. genomic heterozygosity, \(d_{xy}\)) over divergence time (\(t\)). Following Robinson et al. (2022), we use the following equation:
\[\begin{equation} \mu=d_{xy}/2t \end{equation}\]
Genomic divergence (\(d_{xy}\)), in turn, is estimated as:
\[\begin{equation} d_{xy}=(h_1+2h_2)/2g \end{equation}\]
where \(h_1\) is the number of sites in which the sample genome is heterozygous, \(h_2\) is the number of sites in which the sample genome is homozygous for the alternate allele, and \(g\) is the number of sites in which both the sample and reference genomes had called genotypes.
This calculation may result in an overestimate of mutation rate because ancestral (pre-divergence) heterozygosity is not taken into account, which we can account for in the following:
\[\begin{equation} \mu=(d_{xy}-\pi_{anc})/2t \end{equation}\]
The above equations produce a per-year mutation rate estimate. Mutation rates are often reported per generation in the literature, so to consider the effect of including generation time in mutation rate estimates we converted our per-year mutation rates to an estimate of mutation rate per generation. For most species, generation time was available from Taylor et al. (2007, Table 1).
All data analyses were completed in R (R Core Team 2016) using Rstudio (REF). We subset our mutation rate estimates according to each test described below. Unless otherwise noted below, for consistency among analyses we ensured that all analysis datasets included genomes from all 14 species considered in the study, using estimates only from reads aligned to the sperm whale reference genome, divergence date estimates from McGowen et al. (2020), and without correcting for ancestral heterozygosity.
We first consider the effect of the following sources of potential variability in the genome alignment pipeline on downstream estimates of mutation rate:
We also consider the effect of variability in two estimated parameters that are used in the calculation of mutation rate:
Reference species: As described above, we aligned seven of the fourteen species to a deeply diverged reference (hippopotamus) and a recently diverged reference (North Atlantic right whale or killer whale, see Tables 1 and 2) in addition to the sperm whale, which represents the intermediately diverged reference. We then estimated a mutation rate (\(\mu\)) for each of these alignments following the approach outlined above, resulting in a total of 21 estimates of mutation rate (3 per species). We used a Shapiro-Wilk normality test (Royston 1982a, 1982b) to determine that this dataset of 21 mutation rates is normally distributed (p = 0.32). We then grouped whole genome mutation rates from these seven species into four categories by reference species and three categories by divergence time: deep divergence, intermediate divergence, and recent divergence, and used an ANOVA (Hollander, Wolfe, and Chicken 2015) analysis to determine whether mutation rate varies significantly among groups in either stratification scheme.
Chromosome position: We indexed the genomic position of each chromosome using a bed file generated from the annotated sperm whale reference genome, extracted SNP data on a per-chromosome basis, and estimated a mutation rate (\(\mu\)) for each chromosome of each genome included in the study. We used a Shapiro-Wilk normality test to determine that this mutation rate dataset is not normally distributed (p = 10^{-7}). We therefore used a non-parametric Kruskal-Wallis analysis (REF) to determine whether mutation rate varies significantly among chromosomes across all 14 species included in the study, and a post hoc Dunn test (REF) and Bonferroni correction (REF) to identify chromosomes driving a signal of differentiation.
Coding vs non-coding regions: We indexed the genomic position of coding and non-coding regions using a bed file generated the annotated sperm whale reference genome, extracted SNP data in each region, and estimated mutation rates (\(\mu\)) in coding and non-coding regions of each genome included in the study. We used a Shapiro-Wilk normality test to determine that this mutation rate dataset is normally distributed (P VALUE HERE). We then used an ANOVA to compare mutation rate estimates between these two regions across all fourteen species included in the study.
Genotype read-depth: We first generated new genotype vcf files with calls for all loci, including both heterozygous and homozygous loci. Using these vcf files and considering read depths of 10-500 reads, we then used a custom R script to determine the cumulative proportion of heterozygous sites with increasing maximum read depth. Following this, we calculated mutation rate (\(\mu\)) for each species at maximum read depth cutoffs from 10-500. Finally, we use a generalized linear regression to test whether read depth is a significant driver of mutation rate estimates. Because phylogenetic inertia may be a significant factor affecting variability in mutation rate among species, we used a phylogenetic generalized least squares (PGLS) modeling approach implemented using the pgls package in R (REF), and chose the best model and significant parameters based on reported AIC and p values, respectively.
Divergence date estimates: We generated estimates of mutation rate (\(\mu\)) using fossil-dated divergence times (\(t\)) from McGowen et al. (2020) and Lloyd and Slater (2021), as well as an average across all published divergence times from TimeTree [timetree.org; Kumar et al. (2017); Supplemental Table S1]. We then re-estimated whole genome mutation rate using each of the species-specific divergence dates. We used a Shapiro-Wilk normality test to determine that this mutation rate dataset is normally distributed (p = 0.23) and used an ANOVA to compare mutation rate differentiation across the three strata.
Ancestral heterozygosity: Because the ancestral heterozygosity (\(\pi_{anc}\)) of a species is unknown and unknowable using currently available technology, we generated mutation rate estimates under five different states of \(\pi_{anc}\). First, we estimate mutation rate (\(\mu\)) without correcting for \(\pi_{anc}\), and by correcting for \(\pi_{anc}\) under the assumption that the contemporary genomic heterozygosity of each species is equivalent to its ancestral heterozygosity, i.e., using a whole-genome estimate of heterozygosity from each species to correct for \(\pi_{anc}\). We further test the sensitivity of our estimates of \(\mu\) to a range of values of \(\pi_{anc}\) by re-estimating \(\mu\) for every species using the maximum and minimum reported values of \(\pi_{anc}\) within cetaceans from Morin et al. (in prep).We used a Shapiro-Wilk normality test to determine that this mutation rate dataset is not normally distributed (p = 4.6^{-6}) and used a non-parametric Kruskal-Wallis test with an ad hoc Dunn test and Bonferroni correction to compare mutation rate differentiation across the five different \(\pi_{anc}\) estimates.
To examine the effect of phylogenetic inertia on mutation rate, we test for significant differentiation among taxonomic families in whole-genome mutation rate per year (mutations/site/year) as well as whole-genome mutation rate per generation (mutations/site/generation). We also test for rate differentiation among three families for which we have more than one species: Balaenopteridae, Delphinidae, and Ziphiidae.
Finally, we also consider potential effects of lifespan and body size on whole-genome mutation rate using data on species-specific maximum lifespan and maximum body length. Most of these data were acquired from Groot et al. (2023), and missing data were filled in from published literature (Table 1, Supplemental Table S1). We first tested for correlation between lifespan, maximum body length, and generation time and found them to be correlated in our dataset (VALUE HERE). Next, we fit generalized linear regressions considering all possible combinations of these three demographic parameters using the PGLS modeling approach described above (genotype read-depth section).
For our examination of both phylogenetic and demographic effects on mutation rate, we additionally compare the patterns observed using both annual and generational estimates of mutation rate, which are both commonly reported in marine mammal genomic literature (REFS here).
For each combination of species and reference genome, we generated 18 estimates of whole-genome mutation rate by varying estimates of divergence date, ancestral heterozygosity, and annual vs. generational mutation rate estimates. Using the sperm whale reference genome, divergence date from McGowen et al. (2020), and without correcting for ancestral heterozygosity, we also generated an estimate of mutation rate at each chromosome (n = 20 per species), in coding and non-coding regions (n = 2 per species), and at maximum read depths of 100-500 (n = XX per species).
Whole genome mutation rate varies among individual species included in the study, and is overall lower than previously published estimates for all odontocetes or XXXX (Figure 1).
The reference genome used had the greatest overall magnitude of effect among all factors tested in this study (Figure 2, top). Reference genomes for this analysis were chosen to reflect three periods of evolutionary divergence: deep divergence, intermediate divergence, and recent divergence. Mutation rate per year was highest in most species when compared to the deep divergence reference genome. Alignment to the intermediate divergence reference genome resulted in the lowest mutation rate estimates across all species in the study, and alignment to the recent divergence reference genome resulted in mutation rate estimates that were between the deep and intermediate reference genomes in most cases.
Yearly mutation rate estimates did not differ significantly across mysticetes and odontocetes when aligned to the deep reference genomes, but did vary significantly when aligned to the intermediate and recently diverged genomes, with higher annual mutation rates in odontocetes than in mysticetes (Figure 2, bottom).
Mutation rates varied significantly among chromosomes across all species (Kruskal-Wallis p = 9^{-28}). The mean range in mutation rates across chromosomes within each species (10^{-10}) was higher than the range in mutation rates across all species included in the study (5.2^{-11}. Among the 20 autosomal chromosomes and Chromosome X, mutation rate in Chromosomes 1, 4, 9, 16, and X was significantly lower than at least 1/3 of all chromosomes, and mutation rate in Chromosome 13 was significantly higer than at least 1/3 of all chromosomes (Figure 3). We also found that patterns in mutation rate variability among chromosomes was broadly consistent across species - i.e. in general, all of a species’ chromosome mutation rates were generally on the high or low end of the spectrum, with minor variability in this pattern among species (Figure 3).
Annual mutation rate estimates were significantly different in coding vs. noncoding regions (ANOVA p=1.75^{-6}, F = 37.61, Figure 4). Rates were higher in coding regions than non-coding regions across all species included in the study, increasing by an average of 3.6^{-11} mutations/site/year (min = 2.9^{-11}, max = 4.2^{-11}).
Across all species, mutation rate estimates increase rapidly with increasing maximum read depth threshold to an asymptotic maximum mutation rate estimate (Figure 5). Most species reached their asymptotic mutation rate at maximum read depth thresholds between 50 and 100 reads.
Varying divergence date estimates also had a significant effect on annual mutation rate estimates, although the magnitude of the effect is smaller than the effect of reference genome or coding vs. non-coding regions (ANOVA p=0.01, F = 4.82, Figure 6).
Varying ancestral heterozygosity also had a significant effect on annual mutation rate estimates (Kruskal-Wallis p=2.3^{-8}, Figure 7). This was driven by the group of mutation rates that were corrected for the maximum reported cetacean heterozygosity in Morin et al. (REF), which were significantly different from all other correction methods. We did not find any other significant differences across all other pairwise comparisons of correction methods.
We can compare effect sizes (\(\eta^2\)) among all pipeline tests (Table 3) to determine which parameters have the greatest effect on mutation rate estimates. Variability in reference species led to the greatest effect size, and variability in phylogenetic distance to reference made up the majority of that effect. This was followed by variability among genomic locations (chromosomes) and coding vs. non-coding regions. Varying divergence time and the method for handling ancestral heterozygosity each had a comparatively modest effect on mutation rates, except when using the maximum value for ancestral heterozygosity reported by Morin et al. (REF).
Pipeline effect | \(\eta^2\) |
---|---|
reference distance | 0.84 |
reference species | 0.94 |
chromosome position | 0.59 |
coding/noncoding | 0.59 |
read depth | unknown |
divergence date | 0.2 |
\(\pi_{anc}\) | 0.57 |
\(\pi_{anc}\) (ex max \(\pi_{anc}\)) | 0.19 |
We first examined the effect of phylogenetic inertia on mutation rate, and found that annual mutation rate was not significantly different between the two cetacean infraorders (mysticetes, odontocetes) when including the Amazon river dolphin, which has an anomalously low mutation rate (see Figure 1), within the odontocete infraorder (ANOVA p = 0.079, F = 3.67, Figure 8a). However, when excluding the Amazon river dolphin from the odontocete infraorder we find that annual mutation rate is significantly different between infraoders (ANOVA p = 3.7e-05, F = 43.89, Figure 8a inset) and accounts for approximately 80% of annual mutation rate variability, with mutation rates that were significantly lower in mysticetes than in odontocetes. We also found significant differences in annual mutation rate among the three families with 2 or more species included in this study: Balaenopteridae, Delphinidae, and Ziphiidae (ANOVA p = 1.7^{-5}, F = 77.63).
We then re-analyzed mutation rate between infraorders and among families using generational mutation rates, and found no significant differentiation in either stratification (infraorder ANOVA p = 0.098, subset infraorder ANOVA p = 0.15, family ANOVA p = 0.94, Figure 8c+d). Further, the observed pattern in mutation rate differentiation between infraorders is flipped, with a higher mean generational mutation rate in the mysticetes than in the odontocetes, both with and without the Amazon river dolphin (Figure 8c).
None of lifespan, asymptotic length, or generation time were found to significantly correlate with annual mutation rate estimates (p > 0.05 for all terms) when using a generalized linear model approach that accounts for phylogenetic inertia. However, when considering generational mutation rate for lifespan and asymptotic length only we found that the AIC-selected model included a significant positive correlation with lifespan (p = 10^{-4}, Figure 9), such that generational mutation rate increases with increasing lifespan. Models including asymptotic length had higher AIC values than the model with lifespan as the only parameter; it follows that length was not a significant predictor of mutation rate when considered as the only model parameter or in combination with lifespan.
The results of our analyses of phylogenetic, demographic, and pipeline effects on mutation rate estimates indicate both that species-specific mutation rate has varied over the evolutionary history of cetaceans, and that the way mutation rate is estimated will yield significantly different results, both in specific mutation rate estimates as well as in observed relational patterns among species. Perhaps most notably, we observed phylogenetic patterns in annual mutation rate supporting previous assertions that mysticetes have lower annual mutation rates than odontocetes (Jackson et al. 2009). While the same phylogenetic patterns were not observed using generational mutation rates, we instead found support for higher generational mutation rates in longer-lived species. Our results provide evidence that there may be evolutionary selection for lower annual mutation rates within the Mysticeti, which trend toward longer lifespans, supporting recent pedigree-based estimates of generational mutation rates in mysticetes (Suárez-Menéndez et al. 2023). Our data indicate that the phylogenetic trend toward lower annual mutation rates in mysticetes does not fully counteract the effect of longer lifespans in most species, i.e. generational mutation rates were higher in mysticetes and other longer-lived cetaceans.
Another notable pattern is the large difference in mutation rate estimates driven by the phylogenetic distance to the reference species. Our results indicate that the mutation rate has not varied linearly over evolutionary time; rather, we found that the lowest mutation rates were estimated in reference to intermediate divergence times, with higher rate in both recent and deep divergence times (Figure 3A). This could be caused by an excess of mapping errors in the hippopotamus reference genome. However, scaffold N50 indicates similar quality in all reference genomes used in the study; further, if this observed pattern were purely an effect of decreased mapping accuracy to references of increasing divergence, we would expect that estimated mutation rate increases linearly with increasing divergence. Based on this evidence we hypothesize a period of high mutation rates deep within the evolutionary history of cetaceans and hippos, followed by a period of low mutation rates. Although cetacean mutation rate seems to have increased in recent evolutionary history, it remains lower than rates based on deeper divergence within the cetacean phylogeny. Further, mutation rate estimates integrated over deep evolutionary time were not significantly different between the two cetacean infraorders, but a pattern of differentiation emerges when considering intermediate divergence time and increases in magnitude in recent divergence time (Figure 2B).
In addition to the evolutionary and demographic effects we observed on mutation rate, we found that many pipeline parameters can affect mutation rate estimates. In general, altering pipeline parameters tended to shift all estimates of mutation rate in the same direction, with minimal observed shifts in the relative position of mutation rates among species.
Among pipeline effects on mutation rate, the choice of reference genome had the greatest effect on species-specific estimates of whole genome mutation rate, and the distance of the reference genome made up a large portion of this variability. Variability caused by reference genome was followed in effect size by the difference in mutation rate caused by corrections for ancestral heterozygosity and in coding vs. non-coding regions. Variation in divergence date estimates had comparatively minimal effect on mutation rate variability, which indicates a high level of precision and convergence among our current estimates of divergence dates among cetacean species.
The reference-driven differences in mutation rate observed in this study may be caused by several factors. For example, it is possible that the sperm whale evolutionary lineage has an overall lower mutation rate than other evolutionary lineages used as reference in this study. It is also possible that mutation rate accelerates and slows over time, and that mutation rates were higher in deep evolutionary time, slowed in mid-evolutionary time, and are marginally higher in shallow or contemporary evolutionary time. By comparing this pattern in mysticetes and odontocetes, we observed that the annual mutation rates of these two evolutionary lineages was not significantly different over deep evolutionary time, but differed significantly over intermediate and recent evolutionary time (Figure 2), suggesting that the annual mutation rates of each of these families may have embarked on differing evolutionary trajectories once the two families split from each other, resulting in the significantly lower annual mutation rates observed in contemporary mysticetes than in contemporary odontocetes (Figure 8a).
The degree of variability driven by choice of reference genome - up to an order of magnitude in some cetacean species - brings with it the question of how to choose the correct reference genome for alignment of new genomic data. Because mutation rate estimates represent an average over the evolutionary time period since divergence from the reference species it may be best to choose a reference genome that reflects the time period of the question. For example, population genomics or other studies concerned with recent evolutionary time might choose a reference genome from a recently-diverged species, and estimate mutation rate based on the reference species for use in downstream analyses, e.g. PSMC or BEAST phylogenies.
Although not directly targeted by this study, we found that mutation rates for the same species, estimated using the same method, differ based on whether they are estimated annually vs. generationally, with potentially profound effects on downstream interpretation of evolutionary relationships or conservation applications. Recent disagreement about whether mutation rates in mysticetes are larger or smaller than odontocetes (e.g., Jackson et al. 2009; Suárez-Menéndez et al. 2023) may be at least partially explained by differences in the use of annual vs. generational mutation rate, and future studies will need to carefully consider the implications of each of these for the interpretation of their results.
Compared to whole-genome mutation rate estimates among species and reference genomes, estimates of mutation rates among chromosomes varied less, but still significantly. Chromosome X exhibited the highest variability, with a much lower mutation rate than other chromosomes. This is likely driven at least in part by its shorter residence time in males, which contribute more mutations to their offspring than females (Giannelli and Green 2000; Nachman and Crowell 2000). Similarly, mutation rate estimates varied significantly between coding and non-coding regions. These results, taken together, suggest that mutation rate varies considerably throughout the genome, although potentially less-so that the variability in mutation rate over evolutionary time.
Correcting for species-specific ancestral heterozygosity has a linearly predictable effect on mutation rate, because \(\pi_{anc}\) is simply subtracted from estimated genomic divergence before dividing by divergence time. We expect that subtracting any value from genomic divergence will cause a significant effect on the estimated mutation rate; the magnitude of the effect is determined by the size of \(\pi_{anc}\). Our dataset confirmed this expectation. Further, we found that correcting for species-specific genomic heterozygosity was comparable with mean and minimum contemporary cetacean genomic heterozygosity; correcting for heterozygosity in this way has a minimal, but significant, effect on mutation rate estimates. However, our estimates of contemporary cetacean genomic heterozygosity ranged by over an order of magnitude (min = 1.1^{-4}, max = 0.005037), and correcting for ancestral heterozygosity using larger values had a much larger effect on mutation rate estimates. Because species-specific ancestral heterozygosity is currently unknowable, it is important to carefully consider whether and how to account for ancestral heterozygosity when estimating species-specific mutation rates, and to clearly describe this correction in the methods of genomic studies.
Our study includes one species - the blue whale - for which pedigree-based mutation rate has also been estimated (Suárez-Menéndez et al. 2023). Interestingly, the estimates of mutation rate most similar to pedigree-based estimates were obtained when sequences were aligned to the deeply-diverged hippopotamus (\(\mu_{ped}\) = 1.24e-8, CI: 0.85-1.63e-8; \(\mu_{phy}\) range: 1.31-1.47e-8). All estimates of generational whole genome mutation rate in blue whales, regardless of divergence date or ancestral heterozygosity, were within the 95% confidence interval of the estimate from Suarez-Menendez et al. (2023). Mutation rate estimates were lower when sequences were aligned to the intermediate and recently diverged reference species.
More here, REFS from Suarez-Menendez et al and Bergeron et al, Sung et al.).
Species-specific mutation rates evolve over time; cetacean mutation rates are affected by phylogeny and lifespan. Taxa that trend toward longer-lived species have lower annual mutation rates but similar generational mutation rates, highlighting the importance of using the same type of mutation rates when drawing comparison across species. Because mutaton rate varies over time within an evolutionary lineage, the choice of reference species will have a large affect on estimates of heterozygosity and mutation rate; therefore, reference species should be carefully selected to reflect the timespan of the research question. While divergence date and ancestral heterozygosity estimates had smaller effects on mutation rate, we found evidence that the mutation rate varies along the genome; therefore the genomic regions included in a study - e.g. coding vs. non-coding regions or choice of chromosomes - will also affect our estimates of heterozygosity and mutation rate. In most population genetic studies, it is recommendable to include a large number of sites across the genome to better represent the genomic average. In targeted studies, it may be more effective to estimate heterozygosity or mutation rate across a carefully selected subset of the genome. Given the phylogenetic, demographic, genomic, and pipeline effects on variability in mutation rate demonstrated in this study, it is vital that genomic studies carefully consider how to process sequence data within the context of the targeted research question, and give consideration to the potential effects of pipeline parameters on the interpretation of their results.