Mutation rate evolution in cetaceans: understanding drivers of variability and observation bias in estimates of genomic mutation rate

School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA
Southwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, La Jolla, CA

^*Corresponding author email: phillip.morin@noaa.gov

Running page head: Genomic mutation rate in cetaceans

Abstract

Genomic mutation is the fundamental process driving evolutionary processes that result in contemporary species, subspecies, and locally adapted populations. Thus, genomic studies of divergence, speciation, and evolutionary demographics rely on accurate estimates of mutation rate. Historically, mutation rates have been estimated using a small portion of the genome and/or averaged among multiple species within a family or order, but growing evidence indicates potentially high variability in mutation rate both across the genome and among species. Here, we examine variability in mutation rate within cetaceans based on whole genome sequence data mapped to reference-quality genomes, and from a range of pipeline parameters in order to determine which factors have the greatest effect on variability in substitution rate or its estimation. Our results indicate that mutation rate has varied over evolutionary time and across the genome in cetaceans. Notably, we show that generational mutation rate increases with maximum lifespan and annual mutation rate increases with maximum lifespan and decreases with generation time. Mutation rate also decreases with chromosome size. Similarly, we found significant effects of pipeline parameter choice on mutation rate, which is most affected by the choice of reference genome used in the alignment pipeline and choice of genomic region (i.e. chromosome). These results indicate the importance of thoughtful selection of mutation rate used in cetacean genomic studies, in order to ensure accurate downstream analyses of evolutionary processes.

Introduction

As the fundamental process underlying evolution, mutation drives processes such as evolutionary adaptation and divergence among species. The rate at which mutations accrue will ultimately determine the rate at which these processes occur; conversely, reconstruction of evolutionary processes as well as genomic estimates of historical population demography have the potential to be biased by inaccurate estimates of mutation rate^{e.g. 1}. As reliance on genomics increases for the conservation and management of threatened populations², it is increasingly important that we comprehensively understand how mutation rate varies within taxa and with observation bias.

Broad scale trends in mutation rate variability across taxa have been correlated with phylogenetic inertia as well as various demographic factors^{e.g. 3,4–7}. Three hypotheses have been proposed as potential drivers of mutation rate variability among species. (1) The longevity hypothesis posits that mutation rate decreases with increased lifespan, allowing individuals to live for longer periods of time without adverse effects of the accumulation of harmful mutations over their lifetimes⁸. (2) The generation time hypothesis posits that mutation rate decreases with respect to generation time likely due to mutation accumulation in gamete DNA^9,10. (3) The metabolic rate hypothesis posits that mutation rate increases with increasing metabolic rate (often proxied by body length) due to the oxidative effects of metabolism on the cell¹¹.

Mutation rate has been shown to vary over evolutionary timescales⁴. At a finer scale, mutation rate varies across the genome (e.g. among chromosomes and coding vs. non-coding regions), and site-specific variability is only partially context-dependent^12,13. In addition to these factors, there are multiple decision points along the bioinformatic pipeline from sequence data to aligned genome that may affect the accurate tally of genomic mutations and downstream estimation of mutation rate. However, there is limited research on how the selection of evolutionary timescale, genomic region, or alignment pipeline parameters affect estimates of mutation rate.

Mutation rate estimates form the basis of many conservation genomic and evolutionary genomic studies; for example, as a parameter included in estimates of divergence among species, historical population size, or patterns of selective divergence as they relate to neutral variation. Biases in this estimate may be reflected in many downstream analyses that shape our understanding of evolutionary biology; they may also affect conservation priorities and management actions. In one example, a recent analysis of baleen whale mutation rates based on wild pedigrees resulted in higher estimates for these species than when estimated using phylogenetic divergence; using the new mitochondrial mutation rate reduces estimates of pre-exploitation North Atlantic humpback whale abundance by 86%¹⁴ and considerably shifts our understanding of the recovery of that population. Possibly owing in part to the diversity of factors affecting mutation rate, some studies suggest a lack of direct relationship between mtDNA genetic diversity and population size³.

The increasing accessibility of genomic sequencing technologies and high-quality reference genomes for marine mammals^15,16 has precipitated a rise in the use of genomic techniques to address previously unanswered questions about the ecology and conservation of these species^{e.g. 17,18–20}. Because these species are rare, remote, elusive, and protected, it is often difficult to obtain the sample size needed to accurately detect and describe ecological processes and patterns. In cases such as this, whole genome analyses increase sensitivity and power to describe genetic patterns using a smaller sample set, thus making it a potentially powerful tool for studies of evolutionary ecology in marine mammals. These genomic studies in marine mammals, e.g. descriptions of evolutionary phylogenies or historical demographics of species, currently rely on estimates of mutation rate that are averaged across a family or even suborder of species and often based on only a small portion of the genome^{e.g. 21,22}.

Here, we explore variability in phylogeny-based estimation of genomic mutation rate within cetaceans, and describe the major drivers affecting that variability with the aim of providing insight and context for the estimation of an accurate and appropriate mutation rate across cetacean genomics studies. Note that some harmful mutations will be filtered out of a population over time and therefore not accounted for in phylogeny-based estimates of mutation rate. We therefore estimate the substitution rate; i.e. the rate at which new mutations accumulate over time, which is considered equal to mutation rate in cases of neutral evolution and may be lower than mutation rate in cases of positive or stabilizing selection. In the context of phylogeny-based rates, we use mutation rate and substitution rate interchangeably.

To understand how evolutionary processes and observation bias affect contemporary estimates of substitution rate within cetaceans, we examine several factors related to each. First, we examine process driven variability by estimating taxon- and chromosome-specific substitution rates, and comparing species-specific substitution rates with key demographic traits considered to affect substitution rate (i.e., longevity, generation time, and body length). We also examine potential drivers of observation error by varying key parameters in the pipeline used to align genomes and estimate a substitution rate. These input parameters include the choice of reference genome used for alignment, maximum read depth cutoff for genotyping a locus, inclusion of coding or non-coding regions in the mutation rate estimate, divergence date estimate used in the mutation rate calculation, and whether or not ancestral heterozygosity is included in the mutation rate calculation.

Methods

Sample and reference genome preparation

We use publicly available whole genome sequence (WGS) data from 19 cetacean species representing the breadth of cetacean phylogenetic diversity. Shotgun WGS or Arima Hi-C (Arima Genomics, Inc) chromatin-linked data were obtained from NCBI Genbank using SRAToolkit (v. 3.0.7, https://github.com/ncbi/sra-tools), or from the European Nucleotide Archive (Table 1).

In order to test for the effect of reference genome on the resulting substitution rate, we selected four reference genomes to represent three divergence times: a deeply diverged taxon, a recently diverged taxon, and an intermediate taxon (Table 2). The deeply diverged taxon is represented by the hippopotamus (Hippopotamus amphibius, divergence from cetaceans in the current study approx. 53 MYA²³), and the intermediate taxon is represented by the pygmy sperm whale (Kogia breviceps, divergence from odontocetes and mysticetes in the current study ranging between 34 and 37 MYA²³). We selected two species to represent the recently diverged taxon: the killer whale (Orcinus orca, divergence from other odontocetes in this study ranging between 11 and 35 MYA²³) is used for all odontocetes, and the North Atlantic right whale (Eubalaena glacialis, divergence from other mysticetes in this study ranging between 23 and 27 MYA²³) is used for all mysticetes. Exact divergence dates for each species-reference combination are listed in Supplemental Table S1.

We downloaded all reference genomes as fastq files from NCBI Genbank, and created index files using bwa-mem2²⁴ and samtools²⁵. We masked repeats with RepeatMasker (version 4.1.4) and bedtools²⁶. All steps described here, making up the pre-alignment pipeline, are outlined at: https://github.com/UW-WADE-lab/Whole-Genome-Alignment

Table 1. Species abbreviation codes, source database, and accession number for each data file included in the present study, as well as estimates of generation time, asymptotic body length, and maximum lifespan for each species (references in Supplemental Table S1), and which reference species was used for recent divergence, where applicable. NCBI: National Center for Biotechnology Information. ENA: European Nucleotide Archive. Sequence type: WGS, whole genome shotgun sequence; HiC, chromatin-linked sequence.
Species	Abbreviation	Data.source	Accession	Data type	close.reference	Average Generation Time	Maximum Length (m)	Maximum Lifespan
minke whale	Bacu	ENA	ERR11040176	HiC	Egla	22.1	8.65	50
common dolphin	Ddel	ENA	ERR11040185	HiC	Oorc	14.8	1.97	30
Risso’s dolphin	Ggri	ENA	ERR13132929	HiC	Oorc	19.6	3.2	42.5
long-finned pilot whale	Gmel	ENA	ERR11837528	HiC	Oorc	24	3.65	41.08
northern bottlenose whale	Hamp	ENA	ERR10908646	HiC	Oorc	17.8	9.25	37
Pacific white-sided dolphin	Lalb	ENA	ERR11042964	HiC	Oorc	18.1	1.74	30
false killer whale	Pcra	NCBI	SRR33006006	WGS	Oorc	25	3.4	62.5
striped dolphin	Scoe	ENA	ERR11042965	HiC	Oorc	22.5	1.8	57.5
blue whale	Bmus	NCBI	SRR5665644	WGS	Egla	30.8	25.75	131.7
Rice’s whale	Bric	NCBI	SRR10331559	WGS	Egla	18.4	12	71.56
North Atlantic right whale	Egla	NCBI	SRR11097130	WGS	NA	35.7	15.5	73
humpback whale	Mnov	NCBI	SRR17854487	WGS	Egla	21.5	15.5	93
gray whale	Erob	NCBI	SRR12437599	WGS	Egla	22.9	12.5	77
killer whale	Oorc	NCBI	SRR574970, SRR574978, SRR574980, SRR574981	WGS	NA	25.7	5.55	90
Amazon River dolphin	Igeo	NCBI	SRR11430609	WGS	Oorc	10.2	1.998	37.93
pygmy sperm whale	Kbre	NCBI	SRR11430611	WGS	Oorc	12.1	3.125	32.43
Blainville’s beaked whale	Mden	NCBI	SRR13167975	WGS	Oorc	23.4	4.11	60
melon-headed whale	Pele	NCBI	SRR11097149	WGS	Oorc	19.6	2.75	47
vaquita	Psin	NCBI	SRR15435906	WGS	Oorc	11.4	1.37	21

Table 2. Species abbreviation codes, genome source database, and accession number for each species used as a reference genome for alignment in this study.
Species	Abbreviation	Database	Accession	Contig.N50	Scaffold.N50	Coverage
pygmy sperm whale	Kbre	NCBI	GCA_026419965.1	42.8 Mb	122.4 Mb	23.0x
common hippopotamus	Hamb	NCBI	GCA_030028045.1	50.3 Mb	149.2 Mb	33.0x
killer whale	Oorc	NCBI	GCA_937001465.1	45.6 Mb	114.2 Mb	34.0x
North Atlantic right whale	Egla	NCBI	GCA_028564815.2	37.5 Mb	129.6 Mb	32.0x

Sample alignment

Genomic datasets from 19 cetacean species were first aligned to the pygmy sperm whale, which we chose to represent an “intermediate” divergence from both mysticetes and other odontocetes. We then repeated sequence alignment for each species using the hippopotamus as a deeply diverged reference, and either the killer whale (odontocetes) or the North Atlantic right whale (mysticetes) as a more recently diverged reference (Table 1). This resulted in a total of three sequence alignments for each of the 19 species included in the study. The pygmy sperm whale, North Atlantic right whale, and killer whale were each excluded from self-alignment, so that each of these three species have two alignments rather than three.

The alignment pipeline, regardless of sequencing approach, comprised six steps: 1) align sample sequence data to reference, 2) sort aligned genome, 3) remove duplicate sequences, 4) mask repeat regions, 5) index genome, 6) call genotypes and filter variants by read depth. All alignment scripts can be found at https://github.com/UW-WADE-lab/Whole-Genome-Alignment.

All sequence data were aligned using the Burrows-Wheeler Aligner implemented by bwa-mem2²⁴ in Step 1. Forward and reverse Illumina short-read WGS sequence data were aligned simultaneously. Arima Hi-C sequence data, which may contain chimeric reads due to chromatin cross-linking prior to DNA extraction²⁷, require an additional filtering step post-alignment. Therefore, we aligned Arima Hi-C forward and reverse reads separately, then filtered forward and reverse read files using custom perl scripts generated by Arima to identify and remove chimeras, before combining forward and reverse read alignments (https://github.com/ArimaGenomics/mapping_pipeline).

Steps 2-6 were identical for both types of sample genome. We sorted genome alignments using samtools²⁵ and removed duplicates using Picard (http://broadinstitute.github.io/picard). We removed previously identified repeat regions with bedtools²⁶. We indexed the sample genome using samtools, called genotypes using bcftools²⁸, and filtered variants to remove those with a read depth less than 10 or greater than 500, also using bcftools. Note that this read depth filter is less conservative than is normally used in an alignment pipeline^29,30; this allowed us to directly examine the effect of read depth on mutation rate later in the study.

Following alignment, we calculated average read depth and % reads mapping to reference for each sample and reference combination using angsd³¹.

Data analysis: calculating substitution rate

Substitution rate ($\mu$) can be estimated as genomic divergence (i.e. genomic heterozygosity, $d_{xy}$) over divergence time ($t$). Following Robinson et al. (2022), we use the following equation:

\[\begin{equation} \mu=d_{xy}/2t \end{equation}\]

Genomic divergence ($d_{xy}$), in turn, is estimated as:

\[\begin{equation} d_{xy}=(h_1+2h_2)/2g \end{equation}\]

where $h_1$ is the number of sites in which the sample genome is heterozygous, $h_2$ is the number of sites in which the sample genome is homozygous for the alternate allele, and $g$ is the number of sites in which both the sample and reference genomes had called genotypes.

This calculation may result in an overestimate of substitution rate because ancestral (pre-divergence) heterozygosity is not taken into account, which we can account for in the following:

\[\begin{equation} \mu=(d_{xy}-\pi_{anc})/2t \end{equation}\]

The above equations produce a per-year substitution rate estimate. Substitution rates are often reported per generation in the literature, so to consider the effect of including generation time in substitution rate estimates we converted our per-year substitution rates to an estimate of substitution rate per generation. For most species, generation time was available from Taylor et al.³². All generation times are reported in Table 1; the published source of generation time for all species is listed in Supplemental Table S1.

Data analysis

Computation of substitution rate, and all downstream data analyses, were completed in R³³ using Rstudio. We subset our substitution rate estimates according to each test described below. Unless otherwise noted below, for consistency among analyses we ensured that all datasets included genomes from all 19 species considered in the study, using estimates only from reads aligned to the pygmy sperm whale reference genome, divergence date estimates from McGowen et al.²³ (6-partition AR model), with a maximum read depth of 500, and without correcting for ancestral heterozygosity.

Data analysis: phylogenetic or demographic effects on substitution rate

To examine the effect of phylogenetic inertia on substitution rate, we test for significant differentiation among taxonomic families in whole-genome substitution rate per year (substitution/site/year) as well as whole-genome substitution rate per generation substitution/site/generation) using an ANOVA^34,35. We compared observed patterns using both annual and generational estimates of substitution rate because they are both commonly reported in marine mammal genomic literature^{e.g. 14,21}.

We also consider potential effects of lifespan and body size on annual whole-genome substitution rate using data on species-specific maximum lifespan and maximum body length. Most of these data were acquired from Groot et al.³⁶, and missing data were filled in from published literature (Table 1, Supplemental Table S1). We first tested for correlation between maximum lifespan and maximum body length using Spearman’s rank correlation coefficient³⁷, and found them to be correlated in our dataset ($\rho$ = 0.83). Next, we fit a generalized linear regression using both maximum lifespan and maximum body length as predictor variables. Because phylogenetic inertia may be a significant factor affecting variability in mutation rate among species, we used a phylogenetic generalized least squares (PGLS) modeling approach implemented using the phytools package in R³⁸, and chose the best model and significant parameters based on reported AIC and p values, respectively.

Data analysis: pipeline effects on substitution rate

We first consider the effect of the following sources of potential variability in the genome alignment pipeline on downstream estimates of substitution rate:

Choice of reference species for genomic alignment
Chromosome
Inclusion of coding vs. non-coding regions in estimate
Maximum read depth cutoff for called genotypes

We also consider the effect of variability in two estimated parameters that are used in the calculation of substitution rate:

Choice of divergence date estimates
Choice of whether and how to correct for ancestral heterozygosity

Choice of reference species: Using all alignments from deep, intermediate, and recently diverged reference species, we estimated a whole genome substitution rate ($\mu$) following the approach outlined above. We used a Shapiro-Wilk normality test^39,40 to determine that this dataset is normally distributed (p = 0.02). We then grouped whole genome substitution rates in two stratification schemes: four strata (by reference species) and three strata (by divergence time: deep divergence, intermediate divergence, and recent divergence), and used an ANOVA to determine whether substitution rate varies significantly among groups in either stratification scheme.

Chromosome: We indexed the genomic position of each chromosome using a bed file generated from the annotated pygmy sperm whale reference genome, extracted SNP data on a per-chromosome basis, and estimated a substitution rate ($\mu$) for each chromosome of each genome included in the study as described above. We used a Shapiro-Wilk normality test to determine that this substitution rate dataset is not normally distributed (p = 4.71^{-5}). We therefore used a non-parametric Kruskal-Wallis analysis⁴¹ to determine whether substitution rate varies significantly among chromosomes across 18 species included in the study (excluding pygmy sperm whale), and a post hoc Dunn test⁴² to identify chromosomes driving any signal of differentiation. Finally, we tested for a relationship between chromosome-level substitution rates across species and chromosome length using a linear regression.

Inclusion of coding vs non-coding regions: We indexed the genomic position of coding and non-coding regions using a bed file generated from the annotated pygmy sperm whale reference genome, extracted SNP data in each region, and estimated substitution rates ($\mu$) in coding and non-coding regions of each genome included in the study. We used a Shapiro-Wilk normality test to determine that this substitution rate dataset is not normally distributed (p = 0.03). We then used a Kruskal-Wallis test to compare substitution rate estimates between these two regions across 18 species, excluding the pygmy sperm whale.

Maximum read depth threshold: We first grouped all substitutions into bins between read depths of 10-500 reads (i.e. 10-20, 21-30, … 491-500) and calculated the cumulative number of substitutions at each depth bin. Following this, we calculated substitution rate ($\mu$) for each species at maximum read depth cutoffs from 10-500. Finally, we use a generalized additive model to test whether read depth is a significant driver of variability in substitution rate estimates and a Kruskal-Wallis test to estimate the size of the effect of read depth on substitution rate estimates.

Divergence date estimates: We generated estimates of substitution rate ($\mu$) using fossil-dated divergence times ($t$) from McGowen et al.²³ (6-partition AR model) and Lloyd and Slater⁴³ (SAFE metatree analysis), as well as an average across all published divergence times from TimeTree⁴⁴ (timetree.org, Supplemental Table S1). We then re-estimated whole genome substitution rate using each of the species-specific divergence dates. We used a Shapiro-Wilk normality test to determine that this substitution rate dataset is normally distributed (p = 0.01) and used an ANOVA to compare substitution rate differentiation across the three strata.

Ancestral heterozygosity: Because the ancestral heterozygosity ($\pi_{anc}$) of a species is unknown and unknowable using currently available technology, we generated substitution rate estimates under five different states of $\pi_{anc}$. First, we estimated substitution rate ($\mu$) without correcting for $\pi_{anc}$, and by correcting for $\pi_{anc}$ under the assumption that the contemporary genomic heterozygosity of each species is equivalent to its ancestral heterozygosity, i.e., using a whole-genome estimate of heterozygosity from each species from Morin et al. (2025) to correct for $\pi_{anc}$. We further tested the sensitivity of our estimates of $\mu$ to a range of values of $\pi_{anc}$ by re-estimating $\mu$ for every species using the maximum and minimum reported values of $\pi$ within cetaceans from Morin et al. (2025). We used a Shapiro-Wilk normality test to determine that this substitution rate dataset is not normally distributed (p = 0.0298066) and used a non-parametric Kruskal-Wallis test with a post hoc Dunn test and Bonferroni correction to compare substitution rate differentiation across the five different $\pi_{anc}$ estimates.

Results

Aligning sequence data from each of the 19 species in this study with up to three reference genomes resulted in a total of 54 genome alignments; alignments between sequence data and reference genome were excluded if they were generated from the same sample, as is the case for the killer whale, pygmy sperm whale, and North Atlantic right whale. Average read depth and % reads mapping to the reference genome are reported in Supplemental Table S2. For each combination of species and reference genome, we generated 18 estimates of whole-genome substitution rate by varying estimates of divergence date, ancestral heterozygosity, and annual vs. generational substitution rate estimates. Using the pygmy sperm whale reference genome, divergence date from McGowen et al.²³, and without correcting for ancestral heterozygosity, we also generated an estimate of substitution rate at each chromosome (n = 20 autosomes plus X chromosome), in both coding and non-coding regions (n = 2 per species), and at maximum read depths of 20, 30, 40…500 (n = 50 per species).

Whole genome substitution rate varies among individual species included in the study; rates largely fall between previously published average estimates for mysticetes and odontocetes (Figure 1).

Figure 1. Left: Whole genome annual substitution rate for each cetacean species included in the study, colored by species. Warm colors = mysticetes. Cool colors = odontocetes. Reference distance shown by shape. Species abbreviation codes as in Table 1. Average annual mysticete and odontocete substitution rate estimates from Dornburg et al. 2012 (UCLN model) shown as dashed and dot-dash lines, respectively. Images from phylopic.org and rphylopic. Right: Difference in substitution rates between odontocetes (blue) and mysticetes (yellow) across three divergence time strata.

Phylogenetic or demographic effects on substitution rate

We first examined the effect of phylogenetic inertia on substitution rate and found that annual substitution rate was significantly different between the two major cetacean clades (ANOVA p = 1.5e-05, F = 37.39, Figure 2) and accounts for approximately 70% of annual substitution rate variability, with substitution rates that were significantly lower in mysticetes than in odontocetes.

We then re-analyzed substitution rate between suborders using generational substitution rates, and found no significant differentiation (ANOVA p = 0.29, Figure 2). Further, the observed pattern in substitution rate differentiation between suborders is flipped, with a higher mean generational substitution rate in the mysticetes than in the odontocetes.

All three variables - maximum lifespan, asymptotic length, and generation time - were found to be significant (p<0.05) drivers of annual mutation rate (controlling for phylogenetic inertia) in the AIC-selected model (AIC = -827, next model $\Delta$AIC = 4). Maximum lifespan had a relatively small positive effect on annual mutation rate, while asymptotic length and generation time both had relatively larger negative effects on annual mutation rate. Figure 2 illustrates the observed decrease in annual mutation rates with increasing generation time, which was the covariate with lowest p-value (p = 0.037). Supplemental Figure S1 shows predicted mutation rates by generation time and lifespan, holding asymptotic length at the mean across all species in the study (7.3 m).

Maximum lifespan was a significant positive driver of generational substitution rate in the AIC-selected model (AIC = -572.1332894), again controlling for phylogenetic inertia (p = 7^{-5}, Figure 2). The model including asymptotic length had higher a AIC value than the model with lifespan as the only parameter (next model $\Delta$AIC = 45); it follows that asymptotic length was not a significant predictor of generational substitution rate when considered as the only model parameter or in combination with maximum lifespan.

Figure 2. Top: Variability in annual (left) and generational (right) substitution rate estimated in two suborders. Bottom: Phylogeny-controlled linear trend in annual subsitution rate by species’ generation time (left) and generational substitution rate by species’ asymptotic lifespan (right). Points are colored by species (as in Figure 1) in the top two plots; warmer colors are mysticetes and cooler colors are odontocetes. Points are colored by cetacean clade in the bottom two plots.

Pipeline effects on substitution rate

Reference species

The reference genome used had the greatest overall magnitude of effect among all factors tested in this study; a species’ divergence time from the reference accounted for most of this effect (Figure 1, Table 3). Reference genomes for this analysis were chosen to reflect three periods of evolutionary divergence: deep divergence, intermediate divergence, and recent divergence. Substitution rate per year was highest in most species when compared to the deep divergence reference genome. Annual substitution rate estimates increase with increasing divergence from the reference genome, and aligning to more recently diverged species results in a higher mutation rate in mysticetes than odontocetes. Yearly substitution rate estimates did not differ significantly between mysticetes and odontocetes when aligned to the deep reference genomes, but did differ significantly when aligned to the intermediate (KW p = 0.0007) and recently diverged genomes (KW p = 0.002; Figure 1).

Chromosome

Annual substitution rates varied significantly among chromosomes (Kruskal-Wallis p = 7.8553467^{-18}; Figure 3). The range in substitution rates across chromosomes within each species (mean range = 1.5^{-10}) was similar to the range in whole genome substitution rates across all species included in the study (-2^{-10}). Chromosome-level substitution rates were significantly different in at least half of the pairwise comparisons in chromosomes 4, 5, and 14-20. The substitution rate was significantly lower in chromosome X than other chromosomes. Broadly, relative patterns in substitution rate were consistent across chromosomes, with odontocete rates higher than mysticete substitution rates in all chromosomes. Across all species, variability in substitution rate correlated significantly with chromosome size (linear regression p < 0.0001), with lower substitution rates in larger chromosomes.

Coding vs. non-coding regions

Annual substitution rate estimates were not significantly different in coding vs. non-coding regions (Kruskal-Wallis p=0.1638926, Figure 3). Rates were marginally higher in non-coding regions compared to coding regions across all species included in the study, increasing by an average of -1.5^{-11} substitutions/site/year (min = 8^{-12}, max = -2.7^{-11}). Similar to the relative patterns observed among species’ chromosome-level substitution rates, relative substitution rates were similar in coding and non-coding regions, with most odontocetes having higher substitution rates than mysticetes.

Figure 3. Top: Annual substitution rate across chromosomes aligned to the pygmy sperm whale reference genome. Red stars indicate chromosomes that were significantly different from at least 10 chromosomes. Bottom left: Annual substitution rate in coding vs. non-coding regions of the genome. Bottom right: Linear regression showing decreasing mutation rate with increasing chromosome size. Points are colored by species in all subplots; warmer colors are mysticetes and cooler colors are odontocetes.

Maximum read depth filters

At maximum read depth thresholds less than ~150-200, annual substitution rate estimates vary with increasing maximum read depth cutoff (Figure 4). Most species’ substitution rate estimates level out at a maximum read depth cutoff between 150 and 250 reads. GAM modeling indicated that read depth was a significant predictor of substitution rate estimate (p << 0.0001), and a Kruskal-Wallis test confirmed that substitution rates varied significantly (p = 0.005) among cumulative read depth bins (e.g. read depth of 10-20, 10-30, 10-40 … 10-500). This effect was driven primarily by variability below a maximum read depth of approximately 150, where mutation rate estimates stabilized.

Divergence date estimates

Varying divergence date estimates did not have a significant effect on annual substitution rate estimates (Kruskal-Wallis p=0.07, Figure 4). Divergence dates from Lloyd and Slater⁴³ resulted in the highest within-group variability in substitution rate estimates among species, driven primary by variability at the suborder level, and relative substitution rates among species differed slightly among all three strata included in this test.

Ancestral heterozygosity

Varying ancestral heterozygosity also had a significant effect on annual substitution rate estimates (Kruskal-Wallis p=0.001, Figure 4). This was driven by the group of substitution rates that were corrected using the maximum reported cetacean heterozygosity in Morin et al. (2025), which were significantly different from all other correction methods. We did not find any other significant differences across all other pairwise comparisons of correction methods.

$Figure 4. Left: Annual substitution rate (substitutions/site/year) estimated with varying upper read depth cutoff for genotyping variants along the genome. Top plot shows maximum read depth up to 500, and bottom plot zooms in on maximum read depths between 20 and 200. Right top: Annual substitution rate (substitutions/site/year) estimated using divergence times from Lloyd and Slater 2021, McGowen et al. 2020, and Timetree.org. Right bottom: Annual substitution rate (substitutions/site/year) estimated using five approaches to accounting for ancestral heterozygosity: 1) no $\pi_{anc}$, 2) $\pi_{anc}$ equal to contemporary heterozygosity, 3) $\pi_{anc}$ equal to maximum cetacean heterozygosity, 4) $\pi_{anc}$ equal to mean cetacean heterozygosity, and 5) $\pi_{anc}$ equal to minimum cetacean heterozygosity. Red stars indicate strata that were significantly different in post hoc pairwise comparisons.$

Figure 4. Left: Annual substitution rate (substitutions/site/year) estimated with varying upper read depth cutoff for genotyping variants along the genome. Top plot shows maximum read depth up to 500, and bottom plot zooms in on maximum read depths between 20 and 200. Right top: Annual substitution rate (substitutions/site/year) estimated using divergence times from Lloyd and Slater 2021, McGowen et al. 2020, and Timetree.org. Right bottom: Annual substitution rate (substitutions/site/year) estimated using five approaches to accounting for ancestral heterozygosity: 1) no $\pi_{anc}$, 2) $\pi_{anc}$ equal to contemporary heterozygosity, 3) $\pi_{anc}$ equal to maximum cetacean heterozygosity, 4) $\pi_{anc}$ equal to mean cetacean heterozygosity, and 5) $\pi_{anc}$ equal to minimum cetacean heterozygosity. Red stars indicate strata that were significantly different in post hoc pairwise comparisons.

Comparison among all pipeline effects

We compared effect sizes ($\eta^2$) among strata for all pipeline tests (Table 3) to determine relative effect of all parameters in this study on substitution rate estimates. Variability in reference species had the greatest effect on substitution rate estimates, and variability in phylogenetic distance to reference made up the majority of that effect. This was followed by variability driven by maximum read depth. Compared with these three effects, all others were comparatively small, with the choice of coding vs. non-coding regions having the least effect on estimated mutation rate.

Table 3. Species abbreviation codes, genome source database, and accession number for each species used as a reference genome for alignment in this study.
Pipeline effect	$\eta^2$
reference species	0.78
reference distance	0.74
chromosome	0.3
$\pi_{anc}$	0.16
divergence date	0.06
read depth	0.03
coding/noncoding	0.004

Discussion

Evolutionary variability in mutation rates has been described across broad taxonomic categories, e.g. eukaryotes⁹, vertebrates⁴, mammals⁸, and birds³, and in cetaceans using subsets of genomic data^21,22. By interrogating the full nuclear genome, we found strong evidence for phylogenetic and demographic variability in mutation rate among the cetacean species in our study, possibly driven by evolutionary shifts in lifespan and generation time. Cetacean mutation rate also varies throughout the genome, driven at least partially by chromosome size. In addition to ecological and evolutionary variability in mutation rate, we found that the way mutation rate is estimated will yield significantly different results. Relationships among species-specific mutation rates differed when examining annual vs. generational mutation rates and mutation rates by maximum read depth threshold, but otherwise remains relatively consistent.

Evolutionary patterns in mutation rate

Perhaps most notably, we observed phylogenetic patterns in annual substitution rate supporting previous assertions that mysticetes have lower annual mutation rates than odontocetes^21,22. The same phylogenetic patterns were not observed using generational substitution rates; instead we found support for higher generational substitution rates in longer-lived species, supporting recent pedigree-based estimates of high generational mutation rates in some mysticetes¹⁴. While these results may seem contradictory, they highlight the broad range of lifespans within cetaceans and suggest that annual mutation rate has evolved in response to the evolution of longer lifespans. It is possible that evolutionary selection has favored lower annual mutation rates within the long-lived Mysticeti. However, the much longer lifespans in many of these species still results in higher generational mutation rates in mysticetes even with lower annual mutation rates. A similar pattern has been observed when comparing annual and generational mutation rates in humans vs. mice⁴⁵; one study incorporating 69 wild and domestic vertebrate species estimated that annual mutation rate varies 120-fold among species while generational mutation rate varies 40-fold⁴, with generation time as a key driver of both.

Another notable pattern is the decline in species’ substitution rate estimates driven by decreasing divergence time to the reference genome (Figure 1). While this observed pattern may reflect a true evolutionary process, it may also be caused by a decrease in mapping quality with increasing distance to the reference genome; a recent study indicates a small but significant effect of reference distance on the percent of reads mapped and percent missing data and a downstream increase in heterozygosity within a clade of trees⁴⁶, although this study did not consider the potential conflation of observation bias (e.g. caused by differences in read mapping) with true evolutionary process driving changes in mutation rate. A similar study illustrated a decrease in heterozygosity with increasing distance to reference species within a clade of birds⁴⁷. In our study, an average of 90% of reads per species mapped to the distant reference, while 98.8% of reads per species mapped to the most recently diverged reference and 91.6% of reads per species mapped to the intermediate reference.

On the other hand, from an evolutionary perspective lower mutation rates are generally thought to be advantageous as they reduce the prevalence of deleterious mutations, and are facilitated by larger effective population size ($N_e$) and more homogeneous environments^48–50. Our earlier observation that longer-lived cetaceans tended to have lower annual mutation rates may provide an explanation as to why mutation rate seems to have decreased more rapidly in mysticetes than in odontocetes. It is possible that the rapid decrease in mutation rates in observed mysticetes reflects an evolutionary process in response to longer lifespans, and may also be facilitated by ecological strategies in these species. While many odontocetes form resident populations with relatively small $N_e$ (Morin et al. 2025) and the potential for significant environmental variability, it may be that the ecology of mysticete populations, which tend cover large geographic expanses, increases connectivity and decreases temporal variability within the species globally, thus allowing mutation rate to decrease more rapidly than in odontocetes.

Logistical considerations

In addition to the evolutionary and demographic effects we observed on substitution rate, we found that many pipeline parameters can affect substitution rate estimates. In general, altering pipeline parameters tended to shift all estimates of substitution rate in the same direction, with minimal observed shifts in the relative position of substitution rates among species.

Among pipeline effects on substitution rate, the choice of reference genome had the greatest effect on species-specific estimates of whole genome substitution rate, and the time since divergence from the reference genome made up a large portion of this variability. To our knowledge, this is the first report of a difference in substitution rate estimates based on the choice of reference sequence.

Because the choice of reference genome can vary mutation rate estimates by up to an order of magnitude in some cetacean species, it is crucial to consider how best to select an appropriate reference genome for aligning new genomic data. Substitution rate estimates represent an average over the evolutionary time period since divergence from the reference species; it may therefore be important to choose a reference genome that reflects the time period of the study question. For example, population genomics or other studies concerned with processes occurring in recent evolutionary time might choose a reference genome from a recently diverged species, while comparative genomic studies might choose a reference genome from a deeply diverged species. Similarly, studies investigating adaptive divergence or local selection within a species may benefit from using a high-quality conspecific reference, while studies focused on phylogenetic inference, genome architecture evolution, or ancient divergence events may intentionally use a mutation rate estimated using a more distant reference in order to capture variability over the time frame of the study.

A question frequently encountered in developing a pipeline for sequencing alignment and processing is where to set the threshold of maximum read depth. While it is well known that read depths less than 10 can bias estimates of heterozygosity, the effect of very high read depths is not well documented. Regions with very high read depths may be an indication of misalignment or repeat copies that can lead to incorrect genotype calls (e.g., suppression of heterozygous genotypes in collapsed repeat regions). A question frequently encountered in developing a pipeline for sequencing alignment and processing is where to set the threshold of maximum read depth. While it is well known that read depths less than 10 can bias estimates of heterozygosty, the effect of very high read depths is not well documented. Regions with very high read depths may be an indication of misalignment or repeat copies that can lead to incorrect genotype calls (e.g., suppression of heterozygous genotypes in collapsed repeat regions). We found that mutation rate estimates were generally lower at very high read depths, and generally stabilized a maximum read depths greater than 150. At maximum read depth thresholds lower than 150, mutation rate varied on a per-species basis, although the effect was minimal compared to other pipeline parameters tested in this study. While this variability may be attributed to mean depth of coverage across species, we did not find that depth of coverage correlates with annual mutation rate estimates across species which likely precludes this as a contributor to the observed variability. We did find that most callable bases had read depths between 10 and ~150 (Supplemental Figure S2), and that relatively fewer callable bases were gained at read depths > ~200. To further understand the cause of this variability, we recommend examining high coverage genomic regions on a per-species basis to understand differing patterns of heterozygosity.

Substitution rates within the genome varied among chromosomes, but not between coding and non-coding regions. The variability in substitution rate among chromosomes was driven at least in part by chromosome size, with a pattern of lower substitution rates in larger chromosomes that has been observed in other taxa⁵¹. The significantly lower substitution rate observed in Chromosome X is likely driven at least in part by its shorter residence time in males, which contribute more mutations to their offspring than females^52,53. Interestingly, the opposite pattern is observed in birds, where the Z chromosome that has a shorter residence time in females nonetheless has a higher mutation rate than autosomal chromosomes⁵⁴. These results, suggest that genomic studies may refine their mutation rate estimates by targeting the specific chromosomes included in their studies.

Varying divergence date and ancestral heterozygosity had the smallest effect on mutation rate estimates. The minimal effect of varying divergence date estimates indicates a high level of precision and convergence among our current estimates of divergence dates among cetacean species^{23,e.g., 43,44}. Correcting for species-specific ancestral heterozygosity has a linearly predictable effect on substitution, because $\pi_{anc}$ is simply subtracted from estimated genomic divergence before dividing by divergence time. We expect that subtracting any value from genomic divergence will cause a significant effect on the estimated substitution rate; the magnitude of the effect is determined by the size of $\pi_{anc}$. Our dataset confirmed this expectation. No significant differences were found in substitution rates estimates made using species-specific genomic heterozygosity, mean heterozygosity, minimum heterozygosity, or no heterozygosity. The substitution rate estimated using maximum heterozygosity was significantly different from all other estimates. All heterozygsity estimates come from Morin et al. (2025); which reports an anomalously high heterozygosity for K. breviceps compared to all other cetaceans and drives the result reported here. Because species-specific ancestral heterozygosity is currently unknowable, it is important to carefully consider whether and how to account for ancestral heterozygosity when estimating species-specific substitution rates, and to clearly describe this correction in the methods of genomic studies.

Although not directly targeted by this study, we found that substitution rates for the same species, estimated using the same method, differ based on whether they are estimated annually vs. generationally, with potentially profound effects on downstream interpretation of evolutionary relationships or conservation applications. Recent disagreement about whether mutation rates in mysticetes are larger or smaller than odontocetes^{14,e.g., 21} may be at least partially explained by differences in the use of annual vs. generational mutation rate, and future studies will need to carefully consider the implications of each of these for the interpretation of their results.

Subsitution rate vs. pedigree-based mutation rate estimates

Recently, the development of a method to estimate mutation rate by quantifying the number of de novo germline mutations passed down through multi-generational family pedigrees has produced estimates that are considered more accurate and robust than estimates based on phylogenetic relationships^{e.g. 4,14}. While this method remains subject to some of the same sources of error (e.g. sampling bias and postmortem DNA damage, genotyping error), pedigree-based mutation rate estimates allow for direct quantification of inherited mutations without the need for the estimation of parameters such as divergence date or generation time. Although the pedigree approach is a more direct way of estimating mutation rate, for most taxa - particularly elusive marine species - parent and offspring genomes are not available for direct comparison. Further, while using this method with a large sample size can produce an accurate estimate of generational mutation rate, translating that to an annual mutation rate estimate would require knowledge of the age of parents, which is often difficult or impossible to obtain. Pedigree-based estimates will reflect contemporary population demographics and the age of the parents, and may differ significantly from historical mutation rates; however, recent studies comparing the two methods indicate general agreement when comparisons are drawn between generational mutation rates only⁵⁵.

Our study includes two species - the blue and humpback whales - for which pedigree-based mutation rate has also been estimated¹⁴. While sample size was limited in the pedigree study (N = 1 blue whale pedigree and N = 5 humpback whale pedigrees), the study provides good context for comparison among pedigree-based mutation rate estimates and phylogeny-based substitution rate estimates.

The phylogeny-based estimate of generational substitution rate most similar to pedigree-based estimates was obtained when blue whale sequence data were aligned to the North Atlantic right whale ($\mu_{ped}$ = 1.24 x 10^-8, 95% CI: 0.85 x 10^-8 to 1.63 x 10^-8; $\mu_{phy}$ = 1.1852888^{-8}; Supplemental Table S3). All phylogeny-based estimates of generational whole genome substitution rate in blue whales aligned to the North Atlantic right whale, regardless of divergence date or ancestral heterozygosity, were within the 95% confidence interval of the pedigree-based estimate from Suarez-Menendez et al.¹⁴. When aligned to both the pygmy sperm whale and the hippo, $\mu_{phy}$ estimates were above the Suarez-Menendez et al. (2023) 95% CIs ($\mu_{phy}$ = 1.7417312^{-8} and 2.2682233^{-8}, respectively; Supplemental Table S3), reflecting the larger mutation rates in deeper evolutionary time.

Similarly, across five humpback whale pedigrees, Suarez-Menendez et al.¹⁴ estimated a pedigree-based generational mutation rate of 1.12 × 10⁻⁸ (CI: 0.94 × 10⁻⁸ to 1.30 × 10⁻⁸). In this case, $\mu_{phy}$ was slightly below the 95% CI when using the North Atlantic right whale as a reference (0.87 x 10⁻⁸) and slightly above the 95% CI when using the pygmy sperm whale or hippo as a reference (1.2 x 10⁻⁸ and 1.7 x 10⁻⁸, respectively; Supplemental Table S3). In both cases, mutation rate estimates have been made using small samples sizes and may be biased; increasing sample size within species is an important next step toward an accurate estimate of species-specific mutation rates.

Conclusions and Recommendations

Species-specific mutation rates evolve over time; cetacean mutation rates are affected by phylogeny and maximum lifespan, which are highly correlated within the clade. Mysticetes, which trend toward longer-lived species, have lower annual mutation rates but similar generational mutation rates to odontocetes, highlighting the importance of using the same type of mutation rates when drawing comparison across species. Because mutation rate varies over time within an evolutionary lineage, the choice of reference species will have a large affect on estimates of heterozygosity and mutation rate; therefore, reference species should be carefully selected to reflect the timespan of the research question. While divergence date and ancestral heterozygosity estimates had smaller effects on mutation rate, we found evidence that the mutation rate varies along the genome; therefore the chromosomes included in a study will also affect our estimates of heterozygosity and mutation rate. In most population genetic studies, it is recommendable to include a large number of sites across the genome to better represent the genomic average. In targeted studies, it may be more effective to estimate heterozygosity or mutation rate across a carefully selected subset of the genome. Given the phylogenetic, demographic, genomic, and pipeline effects on variability in mutation rate demonstrated in this study, it is vital that genomic studies carefully consider how to process sequence data within the context of the targeted research question, and give consideration to the potential effects of pipeline parameters on the interpretation of their results.

Because substitution rate variability among species within each suborder was minimal, we report mean substitution rates estimated using the pygmy sperm whale reference genome, divergence from McGowen et al. (2020), and no correction for ancestral heterozygosity for both odontocetes (annual = 6.5 x 10^-10, generational = 1.2 x 10^-8) and mysticetes (annual = 5.7 x 10^-10, generational = 1.4 x 10^-8). These estimates may be appropriate for use in studies of population genomics or “recent” evolutionary genomics. However, we recommend that any studies involving species not included in this study estimate a mutation rate using the method employed here; scripts for counting substitutions and estimating substitution rate are available at https://github.com/UW-WADE-lab/Mutation-rate-variability-in-cetaceans. Specific rates for each species and reference genome included in this study are reported in Supplemental Table S3.

ACKNOWLEDGEMENTS

We appreciate the expert input and review of early drafts of this manuscript by multiple colleagues, including Dr. Annabel Beichman and Dr. Michael McGowen. We also thank Dr. Jacqueline Robinson for providing publicly available and reproducible methods for mutation rate estimation from genomic data. High-quality genomic sequence data used in this study were made publicly available by the Cetacean Genomes Project in collaboration with the Vertebrate Genomes Project.

DATA AVAILABILITY

All genomic sequence data and reference genomes were accessed via NCBI or ENA using the accession numbers listed in this manuscript. All genome alignment scripts are available at https://github.com/UW-WADE-lab/Whole-Genome-Alignment, and all scripts used in mutation rate estimation, data analysis, and manuscript generation are available at https://github.com/UW-WADE-lab/Mutation-rate-variability-in-cetaceans.

REFERENCES

Zeng, K., Jackson, B. C. & Barton, H. J. Methods for estimating demography and detecting between-locus differences in the effective population size and mutation rate. Molecular Biology and Evolution 36, 423–433 (2019).

Hohenlohe, P. A., Funk, W. C. & Rajora, O. P. Population genomics for wildlife conservation and management. Molecular Ecology 30, 62–82 (2021).

Nabholz, B., Glémin, S. & Galtier, N. The erratic mitochondrial clock: Variations of mutation rate, not population size, affect mtDNA diversity across birds and mammals. BMC Evolutionary Biology 9, 1–13 (2009).

Bergeron, L. A. et al. Evolution of the germline mutation rate across vertebrates. Nature 615, 285–291 (2023).

Pereira, S. L. & Baker, A. J. A mitogenomic timescale for birds detects variable phylogenetic rates of molecular evolution and refutes the standard molecular clock. Molecular Biology and Evolution 23, 1731–1740 (2006).

Smith, S. A. & Donoghue, M. J. Rates of molecular evolution are linked to life history in flowering plants. Science 322, 86–89 (2008).

Latta, L. C. et al. Genomic background and generation time influence deleterious mutation rates in Daphnia. Genetics 193, 539–544 (2013).

Nabholz, B., Glémin, S. & Galtier, N. Strong variations of mitochondrial mutation rate across mammals—the longevity hypothesis. Molecular Biology and Evolution 25, 120–130 (2008).

Lewin, L. & Eyre-Walker, A. Estimates of the mutation rate per year can explain why the molecular clock depends on generation time. Molecular Biology and Evolution 42, msaf069 (2025).

10.

Thomas, J. A., Welch, J. J., Lanfear, R. & Bromham, L. A generation time effect on the rate of molecular evolution in invertebrates. Molecular Biology and Evolution 27, 1173–1180 (2010).

11.

Martin, A. P. & Palumbi, S. R. Body size, metabolic rate, generation time, and the molecular clock. Proceedings of the National Academy of Sciences of the United States of America 90, 4087–4091 (1993).

12.

Ellegren, H., Smith, N. G. & Webster, M. T. Mutation rate variation in the mammalian genome. Current Opinion in Genetics & Development 13, 562–568 (2003).

13.

Hodgkinson, A. & Eyre-Walker, A. Variation in the mutation rate across mammalian genomes. Nature Reviews Genetics 12, 756–766 (2011).

14.

Suárez-Menéndez, M. et al. Wild pedigrees inform mutation rates and historic abundance in baleen whales. Science 381, 990–995 (2023).

15.

Morin, P. A. et al. Building genomic infrastructure: Sequencing platinum-standard reference-quality genomes of all cetacean species. Marine Mammal Science 36, 1356–1366 (2020).

16.

Morin, P. A. et al. Genomic infrastructure for cetacean research and conservation: Reference genomes for eight families spanning the cetacean tree of life. Frontiers in Marine Science 12, (2025).

17.

Árnason, Ú., Lammers, F., Kumar, V., Nilsson, M. A. & Janke, A. Whole-genome sequencing of the blue whale and other rorquals finds signatures for introgressive gene flow. Science Advances 4, eaap9873 (2018).

18.

Cammen, K. M., Wilcox, L. A., Rosel, P. E., Wells, R. S. & Read, A. J. From genome-wide to candidate gene: An investigation of variation at the major histocompatibility complex in common bottlenose dolphins exposed to harmful algal blooms. Immunogenetics 67, 125–133 (2015).

19.

Kardos, M. et al. Inbreeding depression explains killer whale population dynamics. Nature Ecology and Evolution 7, 675–686 (2023).

20.

Skovrind, M. et al. Elucidating the sustainability of 700 y of Inuvialuit beluga whale hunting in the Mackenzie River Delta, Northwest Territories, Canada. Proceedings of the National Academy of Sciences 121, e2405993121 (2024).

21.

Jackson, J. A. et al. Big and slow: Phylogenetic estimates of molecular evolution in baleen whales (suborder Mysticeti). Molecular Biology and Evolution 26, 2427–2440 (2009).

22.

Dornburg, A., Brandley, M. C., McGowen, M. R. & Near, T. J. Relaxed clocks and inferences of heterogeneous patterns of nucleotide substitution and divergence time estimates across whales and dolphins (Mammalia: Cetacea). Molecular Biology and Evolution 29, 721–736 (2012).

23.

McGowen, M. R. et al. Phylogenomic resolution of the cetacean tree of life using target sequence capture. Systematic Biology 69, 479–501 (2020).

24.

Li, H. Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM. https://arxiv.org/abs/1303.3997v2 (2013).

25.

Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, (2009).

26.

Quinlan, A. R. & Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

27.

Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science (New York, N.Y.) 326, 289–293 (2009).

28.

Narasimhan, V. et al. BCFtools/RoH: A hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics 32, 1749–1751 (2016).

29.

Alves, J. M. & Posada, D. Sensitivity to sequencing depth in single-cell cancer genomics. Genome Medicine 10, 29 (2018).

30.

Desai, A. et al. Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data. PLOS ONE 8, e60204 (2013).

31.

Korneliussen, T. S., Albrechtsen, A. & Nielsen, R. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics 15, 356 (2014).

32.

Taylor, B. L., Chivers, S. J., Larese, J. & Perrin, W. F. Generation length and percent mature estimates for IUCN assessments of cetaceans. Administrative Report LJ-07-01 National Marine Fisheries 24 (2007) doi:doi:10.1.1.530.4789.

33.

R Core Team. R: A language and environment for statistical computing. (2016).

34.

Hollander, M., Wolfe, D. A. & Chicken, E. Nonparametric Statistical Methods. (John Wiley & Sons, Inc., 2015).

35.

Cochran, W. & Snedecor, G. Statistical Methods. (Iowa State University Press, Ames, Iowa, 1967).

36.

Groot, N. E., Constantine, R., Garland, E. C. & Carroll, E. L. Phylogenetically controlled life history trait meta-analysis in cetaceans reveals unexpected negative brain size and longevity correlation. Evolution 77, 534–549 (2023).

37.

Spearman Rank Correlation Coefficient. in The Concise Encyclopedia of Statistics 502–505 (Springer, New York, NY, 2008). doi:10.1007/978-0-387-32833-1_379.

38.

Revell, L. J. Phytools 2.0: An updated R ecosystem for phylogenetic comparative methods (and other things). PeerJ 12, e16505 (2024).

39.

Royston, J. P. Algorithm AS 181: The W test for normality. Journal of the Royal Statistical Society. Series C (Applied Statistics) 31, 176–180 (1982).

40.

Royston, J. P. An extension of Shapiro and Wilk’s Wtest for normality to large samples. Journal of the Royal Statistical Society. Series C (Applied Statistics) 31, 115–124 (1982).

41.

Kruskal, W. & Wallis, W. A. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47, 583–621 (1952).

42.

Dunn, O. J. Multiple Comparisons Using Rank Sums. Technometrics 6, 241–252 (1964).

43.

Lloyd, G. T. & Slater, G. J. A total-group phylogenetic metatree for Cetacea and the importance of fossil data in diversification analyses. Systematic Biology 70, 922–939 (2021).

44.

Kumar, S., Stecher, G., Suleski, M. & Blair Hedges, S. Timetree: A resource for timelines, timetrees, and divergence times. Molecular Biology and Evolution 34, 1812–1819 (2017).

45.

Lindsay, S. J., Rahbari, R., Kaplanis, J., Keane, T. & Hurles, M. E. Similarities and differences in patterns of germline mutation between mice and humans. Nature Communications 10, 4053 (2019).

46.

Mohn, R. A., Garner, M., Manos, P. S. & Hipp, A. L. Variant Calling in the Goldilocks Zone: How Reference Genome Choice and Read Mapping Stringency Impact Heterozygosity Estimates and Phylogenetic Analyses. Molecular Ecology Resources 26, e70079 (2026).

47.

Galla, S. J. et al. Reference Genomes from Distantly Related Species Can Be Used for Discovery of Single Nucleotide Polymorphisms to Inform Conservation Management. Genes 10, 9 (2019).

48.

André, J.-B. & Godelle, B. The evolution of mutation rate in finite asexual populations. Genetics 172, 611–626 (2006).

49.

Ohta, T. The nearly neutral theory of molecular evolution. Annual Review of Ecology, Evolution, and Systematics 23, 263–286 (1992).

50.

Wei, W. et al. Rapid evolution of mutation rate and spectrum in response to environmental and population-genetic challenges. Nature Communications 13, 4752 (2022).

51.

Künstner, A. et al. Comparative genomics based on massive parallel transcriptome sequencing reveals patterns of substitution and selection across 10 bird species. Molecular Ecology 19, 266–276 (2010).

52.

Giannelli, F. & Green, P. M. The X chromosome and the rate of deleterious mutations in humans. American Journal of Human Genetics 67, 515 (2000).

53.

Nachman, M. W. & Crowell, S. L. Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297–304 (2000).

54.

Zhang, G. et al. Comparative genomics reveals insights into avian genome evolution and adaptation. Science (New York, N.Y.) 346, 1311–20 (2014).

55.

Campbell, C. R. et al. Pedigree-based and phylogenetic methods support surprising patterns of mutation rate and spectrum in the gray mouse lemur. Heredity 127, 233–244 (2021).

Supplemental Table S1. Full sample metadata for each species and reference genome combination used in this study.

Supplemental Table S2. Average read depth and % reads mapped for each species and reference genome combination used in this study.

Supplemental Table S3. Species-specific whole genome annual and generational substitution rate estimates for each species-reference combination included in this study, using divergence dates from McGowen et al. (2020) and no correction for ancestral heterozygosity.

Supplemental Figure S1. Top: Predicted annual substitution rate values by generation time and lifespan from linear model shown in Figure 2. Bottom: Predicted annual substitution rate values by generation time and asymptotic length from linear model shown in Figure 2.