Research Article |
Corresponding author: Seth D. Musker ( sethmusker@gmail.com ) Academic editor: Félix Forest
© 2025 Seth D. Musker, Nicolai M. Nürk, Michael D. Pirie.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Musker SD, Nürk NM, Pirie MD (2025) Maximising informativeness for target capture-based phylogenomics in Erica (Ericaceae). PhytoKeys 251: 87-118. https://doi.org/10.3897/phytokeys.251.136373
|
Plant phylogenetics has been revolutionised in the genomic era, with target capture acting as the primary workhorse of most recent research in the new field of phylogenomics. Target capture (aka Hyb-Seq) allows researchers to sequence hundreds of genomic regions (loci) of their choosing, at relatively low cost per sample, from which to derive phylogenetically informative data. Although this highly flexible and widely applicable method has rightly earned its place as the field’s de facto standard, it does not come without its challenges. In particular, users have to specify which loci to sequence—a surprisingly difficult task, especially when working with non-model groups, as it requires pre-existing genomic resources in the form of assembled genomes and/or transcriptomes. In the absence of taxon-specific genomic resources, target sets exist that are designed to work across broad taxonomic scales. However, the highly conserved loci that they target may lack informativeness for difficult phylogenetic problems, such as that presented by the rapid radiation of Erica in southern Africa. We designed a target set for Erica phylogenomics intended to maximise informativeness and minimise paralogy while maintaining universality by including genes from the widely used Angiosperms353 set. Comprising just over 300 genes, the targets had excellent recovery rates in roughly 90 Erica species as well as outgroups from Calluna, Daboecia, and Rhododendron, and had high information content as measured by parsimony informative sites and Quartet Internode Resolution Probability (QIRP) at shallow nodes. Notably, QIRP was positively correlated with intron content, while including introns in targets—rather than recovering them via exon-flanking “bycatch”—substantially improved intron recovery. Overall, our results show the value of building a custom target set, and we provide a suite of open-source tools that can be used to replicate our approach in other groups (https://github.com/SethMusker/TargetVet).
Bioinformatics, Ericaceae, Phylogenomics, Target capture
The field of angiosperm phylogenetics has seen considerable advances in the last decade, much of which is owed to the democratisation of phylogenomics via target capture (
When starting a target capture-based phylogenomic project, researchers need to decide whether to use a universal target set or one that is taxon-specific. While universal sets allow comparability of sequence data among different studies in flowering plants (
Designing a custom target set affords the researcher the opportunity to choose targets optimised for their study system, which might include attempting to maximise informativeness, minimise the chance of downstream errors, or allow for comparability to other data sets. Further, there may be the possibility of using newly available resources, such as draft genomes or new software, to improve on pre-existing target sets. Several factors need consideration when designing a custom target set:
The genus Erica is notably large, comprising over 851 species distributed in Europe and Africa (
We set out to design a novel target set that would enable accurate phylogenomic analysis of closely related Erica species, but which could also be used to study relationships at both higher levels (e.g., between African and European Erica species, or between genera within Ericaceae) and lower levels (e.g., between closely related taxa in species complexes, or between populations within species). We implemented a mixed approach incorporating both universal and taxon-specific loci, as well as a mixture of intron-containing and exon-only loci. This approach aimed to balance concerns about paralogy, informativeness, comparability, and cost.
Implementing the approach involved (1) refining a pre-existing target set by iteration; (2) adding more targets derived from several recently published high-quality Rhododendron genomes and by reference to the angiosperm-wide Angiosperms353 target set; and (3) producing and using new whole-genome shotgun (WGS) sequencing data from three Erica species to quality check the new targets and produce Erica-specific versions of most targets, including the full gene sequences (exons and introns). We present a new pipeline to distinguish and identify paralogs during both customized target design and sequence data curation.
Furthermore, we investigate the impacts of alternative target set design choices on downstream analyses. Firstly, we ask whether draft genomes and WGS reads can be used to predict the presence and paralogy of potential targets. Secondly, we investigate the effect of different target identification methods on the usefulness and quality of the targets. Lastly, we examine the costs and benefits of explicitly targeting intronic regions with emphasis on capture efficiency and phylogenetic informativeness.
Our primary goal was to improve on the work of
The marker identification steps produced a very large number of candidate loci which we filtered based on a variety of criteria. Notably, we were able to evaluate not only the presence of each gene in Erica but also its status as single copy. This was enabled by newly generated high-depth shotgun WGS data from three Erica species. We made further use of this WGS data by building a draft genome of Erica cinerea L. and, where possible, we used its scaffolds to produce Erica-derived “full gene” versions of the targets, i.e., including both exon and intron sequences.
Finally, we conducted a target capture experiment with 295 samples (mostly of Cape Erica, but also including many European Erica species and additional genera serving as outgroups) using the newly developed target set, which comprised a total of 303 genes. We used the data from this experiment to evaluate the quality of each target in terms of capture efficiency, rate of paralogy, and phylogenetic informativeness, and tested whether these differed between targets produced using different methods. Specifically, we tested for differences between targets (1) identified by refining the
We developed a user-friendly suite of open-source command-line tools, TargetVet, which can be used to aid in developing and assessing a target set. The source code and a detailed account of the tool’s functionality and usage, including example code, are available at https://github.com/SethMusker/TargetVet. A diagram illustrating TargetVet’s functionality is presented in Suppl. material
Genomic DNA was extracted from fresh leaf material of three Erica species growing in the University of Bergen (UiB; Norway) arboretum following a custom protocol (
Raw reads were trimmed using fastp (
Refinement method
We used MarkerMiner v.1.2 (
Because MarkerMiner identified many more genes than could be added to the target set given the total footprint available to the project (Suppl. material
To incorporate the widely used Angiosperms353 target set, we adapted “NewTargets” developed by
Because WGS sequencing represents a largely unbiased method of deriving sequences from a genome, we reasoned that read mapping depth information could be used to infer (1) presence/absence and (2) paralogy of the candidate targets in Erica. In theory, missing targets should have a depth of zero while duplicated regions should have a depth roughly twice that of the mean across all targets (assuming most targets are single-copy). Erica cinerea has a considerably smaller genome than most Erica species with genome size data, including E. trimera (based on the assembly size) and E. cerinthoides (
Additionally, for the Refinement set we applied the above process separately to the E. grandiflora-derived supercontigs and the original transcript-derived targets and added the latter to the final set if they passed the filters but the former failed. We added to TargetVet a pair of command-line scripts (map_WGS_to_targets.sh and VetTargets_WGS.R) which can be applied to any data when provided with one or more WGS read files and a set of target sequences (Suppl. material
We next aimed to produce Erica-derived versions of the new MarkerMiner and NewTargets sets, with the aim being to improve capture efficiency by increasing sequence similarity and including introns. We chose to use only the E. cinerea assembly as it was by far the most contiguous and complete of the three. We removed any scaffolds in the assembly < 500 bp long. The targets were translated to protein sequences using EMBOSS (
The WGS read depth-based filtering procedure described above was repeated for the genomic sequences to help ensure that they were present and single-copy across their full length in other Erica species. Genomic sequences that failed read depth filtering were reverted to their Rhododendron transcript version (which had already passed the filters), while those that passed were substituted in for their corresponding Rhododendron transcripts.
The final target set was used in a target capture experiment including 295 samples, mostly of Cape Erica species. DNA was extracted using a custom protocol (
To investigate the effects of target source (i.e., Rhododendron CDS versus Erica genome) and marker identification method (i.e., Refinement, MarkerMiner and NewTargets) on aspects of target recovery and assembly, we assembled the targets from all 295 samples using HybPiper v.2.0.1. We ran HybPiper’s assemble module using BWA-MEM v.0.7.17 for read mapping, SPAdes v.3.15.3 for assembly (with kmer values of 33 and 77), exonerate v.2.4.0, and BBTools v.38.92.
Prior to assembly with HybPiper, in order to ease computational burden we used reformat.sh from BBTools to randomly subsample each sample’s reads to one million read pairs. Given a total target footprint of 1,161,538 bp and assuming a mean read pair length of ca. 290 bp (to account for trimming and pair overlaps), this gives an expected mean coverage of ca. 250X.
To investigate paralogy we first used HybPiper’s length-based criterion which, on a per-sample basis, flags a target as a potential paralog if its second-longest contig’s length is above a certain proportion (which we set to 0.75, the default) of the longest contig’s length (
Graphical illustration of how TargetVet’s VetHybPiper.sh script estimates paralogy from HybPiper results. First, the assemblies for each target are collated into a single multifasta. These scaffolds are then matched to the reference targets using BLAST. Using the BLAST result, VetTargets_genome.R calculates paralogy % for each target. This process is repeated for each sample in order to populate the paralogy matrix, which DetectParalogs.R analyses to produce summary statistics and visualisations.
.
Additionally, using the above definitions, missingness (M) can be estimated as the fraction of the target’s length with c = 0, and copy number (C) can be estimated as the mean coverage across sites ignoring sites with c = 0.
Estimates of P, M and C were derived from two separate BLASTn mapping results: one in which the actual target sequences were used as the reference, and one in which the transcript versions of the targets were used as the reference. To remove putative paralogs, we discarded targets with mean P (across 295 samples) > 40% according to either of the two BLAST results (n = 13). To remove targets that were poorly recovered, we discarded those with mean M > 40% according to the BLAST result based on the target sequences (n = 5). This reduced the total number of genes from 303 to 285, and herein we refer to these target sets as “Erica303” and “Erica285”, respectively. Unless otherwise stated, all further analyses used the Erica285 target set.
To test whether Erica genome-derived targets had greater capture efficiency than Rhododendron CDS-derived targets, we used separate fixed effect models for each marker identification method to model supercontig length as a function of target source, including sample as a fixed effect to account for random variance, while also allowing the sample effect to vary by transcript length to account for the tendency for longer transcripts to have longer supercontigs. We used HybPiper’s stats module to collect transcript and supercontig lengths for all samples.
Exon-derived baits are only able to capture intronic sequences flanking the exons, meaning that sequence coverage drops off considerably with increasing distance from the nearest exon (
To assess the usefulness of the targets for phylogenomics, we selected a subset of 32 samples including three outgroup samples (Calluna, Daboecia, and Rhododendron) and eight European, one Madagascan, one East African, and 19 Cape Erica (details in Suppl. material
Supercontig MSAs were generated using the L-INS-i algorithm of MAFFT (
Species trees were estimated using a concatenation method and a summary coalescent method. For the concatenation method, IQ-TREE v.2.2.0 (
For the summary coalescent method we used a modification of ASTRAL (
As a means of assessing the impact of paralogs and poorly recovered loci on phylogenetic inference, we ran both IQ-TREE and ASTRAL analyses separately on the Erica303 and Erica285 target sets.
We compared trees inferred using different marker sets and different methods using cophylo from phytools (
Lastly, we aimed to investigate the effects of marker identification method and target source on phylogenetic informativeness. AMAS (
We estimated QIRP for the crown of the clade consisting of the E. abietina/E. viscaria clade, the E. massonii clade, and the E. corifolia clade. All of these clades were recovered with good support by
The quality of the draft genome assemblies of Erica cinerea, E. trimera, and E. cerinthoides varied considerably (Table
E. cinerea | E. trimera | E. cerinthoides | |
---|---|---|---|
Read statistics | |||
Number of read pairs | 340,904,000 | 282,465,000 | 284,039,000 |
% reads merged | 50.69% | 43.84% | 43.97% |
Mean insert size | 306.8 bp | 299.1 bp | 303.4 bp |
Assembly statistics | |||
Scaffold sequence total | 353.050 Mb | 708.005 Mb | 679.014 Mb |
Number of scaffolds | 286,992 | 1,852,782 | 1,463,182 |
Number of scaffolds > 50 kb | 670 | 51 | 1 |
% genome in scaffolds > 50 kb | 13.11% | 0.43% | 0.01% |
Scaffold N50 | 5,597 | 124,874 | 73,631 |
Scaffold L50 | 15,727 bp | 616 bp | 1,028 bp |
Max. scaffold length | 192,106 bp | 121,715 bp | 54,438 bp |
Mean (SD) GC content | 39.5% (0.92%) | 44.9% (1.08%) | 40.3% (0.89%) |
Of the 134
A total of 1,572 mostly single-copy genes were identified by MarkerMiner as being present in at least one of the three Rhododendron transcriptomes (Suppl. material
Of the 353 genes in the Mega353 reference set, 348 were found in at least one of the three Rhododendron transcriptomes and 101 of these were longer than 1,000 bp. Of these, 87 passed WGS depth-based filtering, 59 of which had good matches in the E. cinerea genome. Seven of these failed depth-based filtering and were reverted to their transcript form, leaving 52 genomic sequences and 35 transcript sequences in the final NewTargets set.
After all of the above steps the final combined target “superset” consisted of 303 targets with a combined length of 1,161,538 bp, which we refer to as the “Erica303” set.
Overall paralogy was low across the target superset according to both length- and coverage-based analyses (Suppl. material
Heatmap showing paralogy (P), the estimated proportion of a target’s length covered by more than one assembled contig, for all samples and all loci in the Erica303 superset. Values of P were calculated from BLAST results using the actual target sequences. Targets and samples are arranged by mean P. This plot is a direct product of the TargetVet script VetHybPiper.sh.
Most samples showed similar paralogy patterns (Fig.
Patterns of paralogy (P) per sample. Targets (x-axis) are arranged in ascending order by mean P across all samples. Curves show the predicted P for each sample obtained from n-parameter logistic regressions. The single sample that deviated from the mean P by more than 20% on average across all targets is highlighted (yellow line) and labelled. This plot is a direct product of the TargetVet script VetHybPiper.sh.
Genome-derived targets produced significantly longer supercontigs than CDS-derived targets for the MarkerMiner (1,162 bp longer) and NewTargets (1,647 bp longer) sets, but significantly shorter supercontigs for the Refinement set (1,075 bp shorter; Suppl. material
The analysis of intron length in relation to gene length suggested that Erica-derived targets captured relatively more intronic sequence (Table
The relationship between gene length and intron length depends on the source of the target and the method of target set design. For MarkerMiner and NewTargets targets, the slope is steeper for genome-derived targets (solid lines) than for CDS-derived targets (dashed lines). For Refinement targets, the slope is steeper for CDS-derived targets, though these also have relatively less intronic sequence on average. The dotted lines indicate the 1:1 line. Results of the statistical tests to compare the slopes are given in Table
Results of the fixed effects models of intron length as a function of target source and gene length, showing that longer introns were recovered by Erica genome-derived targets identified using NewTargets and MarkerMiner, whereas longer introns were recovered by Rhododendron CDS-derived targets identified using the Refinement method. The relationship was unaffected by sample identity (R2 ≈ Within R2). Numbers in brackets are standard errors.
MarkerMiner | NewTargets | Refinement | |
---|---|---|---|
Gene length × Source = Erica genome: slope | 0.721*** | 0.781*** | 0.648*** |
(0.003) | (0.005) | (0.002) | |
Gene length × Source = Rhododendron CDS: slope | 0.650*** | 0.598*** | 0.826*** |
(0.005) | (0.003) | (0.005) | |
Source = Rhododendron CDS: intercept | 18.0 | 607.2*** | -1,428.7*** |
(21.1) | (27.2) | (18.4) | |
Observations | 33,599 | 24,691 | 30,957 |
R2 | 0.900 | 0.904 | 0.795 |
Within R2 | 0.900 | 0.904 | 0.795 |
The presence of paralogs and poorly recovered genes had no effect on species tree topology and little effect on branch support (Suppl. material
Tanglegram comparing the phylogenies inferred by concatenation (IQ-TREE; Left) and by ASTRAL (Right) using the Erica285 target superset, which excludes putative paralogs and genes with excessive missing data. For the concatenation tree, branch lengths are in substitutions per site and node labels are SH-alrt/UFBoot percentages. For the ASTRAL tree, branch lengths represent coalescent units (except for terminal branches which are are arbitrarily set to 1 as they are not estimated by ASTRAL) and node labels show posterior probability support. Nodes with full support are unlabelled. The trees are fully bifurcating and are rooted along the branch between the Erica and non-Erica samples arbitrarily for display purposes.
There were also some discrepancies between the “traditional” marker-based phylogeny of
Tanglegram comparing the phylogenies inferred by Pirie et al. using traditional markers (Left) and by concatenation using the Erica285 superset (Right). For the Pirie tree, branch lengths are in substitutions per site and node labels show bootstrap percentage. For the concatenation tree, node values indicate SH-alrt/UFBoot when either value was less than 100%.
Tanglegram comparing the phylogenies inferred by Pirie et al. using traditional markers (Left) and by ASTRAL using the Erica285 superset (Right). For the Pirie tree, branch lengths are in substitutions per site and node labels show bootstrap percentage. For the ASTRAL tree, node values indicate local posterior probability values below 1.
In summary, there were some topological conflicts between the Pirie tree and the newly inferred trees, as well as between the trees inferred by different methods using the new targets, but only one of the conflicting relationships (the placement of E. australis) was strongly supported. Overall, the relationships inferred using the new targets were mostly concordant with prior expectations based on previous work and also produced much more strongly supported topologies, with limited conflict within the “VMC clade” localised at a single node surrounded by very short branches.
Table
Results of the fixed effects models of parsimony-informative (PI) sites (number and proportion) as a function of target source for supercontig alignments, using the Erica285 set. More PI sites were recovered by Erica genome-derived targets identified using NewTargets and MarkerMiner, whereas fewer were recovered using the Refinement method. In contrast, the proportion of PI sites was slightly greater in Rhododendron CDS-derived targets for all methods, though the mean difference never exceeded 1%. Numbers in brackets are standard errors.
MarkerMiner | NewTargets | Refinement | ||||
---|---|---|---|---|---|---|
Number | Prop. (%) | Number | Prop. (%) | Number | Prop. (%) | |
(Intercept) | 717.3*** | 9.44*** | 720.8*** | 8.80*** | 374.2*** | 9.11 |
(44.3) | (0.217) | (41.2) | (0.172) | (25.6) | (0.143) | |
Source = Rhododendron CDS | -130.2* | 0.491 | -223.5*** | 0.598** | 180.5** | 0.802 |
(72.3) | (0.354) | (71.4) | (0.299) | (70.2) | (0.393) | |
Observations | 109 | 109 | 78 | 78 | 98 | 98 |
R2 | 0.029 | 0.018 | 0.114 | 0.050 | 0.064 | 0.042 |
Adjusted R2 | 0.020 | 0.008 | 0.103 | 0.038 | 0.055 | 0.032 |
Overall, informativeness as measured by QIRP was relatively high (mean = 0.80 ± 0.15 SD), indicating that the target set was informative for young, short internodes. The proportion of PI sites showed no relationship with QIRP, whereas the total number of PI sites showed a strong positive correlation with QIRP (Fig.
For a given number of PI sites, QIRP values of genome-derived alignments were much higher than those of CDS-derived alignments for the Refinement set (linear model: F(1,96) = 27.0, R2 = 0.21, p < 0.001), but not for the other sets (NewTargets: F(1,76) = 2.82, R2 = 0.023, p = 0.097; MarkerMiner: F(1,107) = 2.93, R2 = 0.018, p = 0.090; Fig.
Regardless of target source, the proportion of intron sequence had a strong and significant positive relationship to QIRP (Fig.
We developed and tested a new target set for Erica phylogenomics using a variety of methods. Overall, we were able to implement effective measures that kept the rate of paralogy and missingness in the resulting target capture data to very low levels. Post-assembly refinement of the target set only slightly reduced the number of targets from 303 to 285, suggesting that the target design approaches effectively identified most undesirable loci. Furthermore, good target recovery in the three non-Erica samples tested (Rhododendron rex, Calluna vulgaris, and Daboecia cantabrica) suggests that the targets could also be applied to these genera, and perhaps even to more distant relatives (i.e., in Ericaceae beyond the Ericoideae). In the supplementary data we also provide a version of the target set including only the Rhododendron-derived, intron-free targets, which users may prefer as a more conservative option when working with Rhododendron or other Ericoideae.
Our results demonstrate that the new target set has excellent phylogenetic informativeness. Notably, one of the major reasons for this was the inclusion of intronic sequences in ca. 70% of the targets used for bait design. Although this approach has rarely been attempted (
Comparing target design methods (MarkerMiner, NewTargets, and Refinement) revealed certain differences in phylogenetic informativeness that are not easy to explain. While the Refinement method (i.e., designing targets from supercontigs assembled from a previous target capture experiment) produced loci with the highest QIRP per number of PI sites on average (Fig.
Looking beyond our specific target set, we expect that the target design methods presented here are generally applicable to any plant group. These include (i) using the NewTargets method of
The authors have declared that no competing interests exist.
No ethical statement was reported.
Primary funding for the project was provided by the Deutsche Forschungsgemeinschaft (PI 1169/1-2). Computing facilities were provided by the University of Cape Town’s ICTS High Performance Computing team (https://hpc.uct.ac.za) and the Centre for High Performance Computing at the University of Bayreuth (https://www.bzhpc.uni-bayreuth.de). Collections were made under permits from CapeNature (CN35-31-8281) and SANParks (CRC/2019-2020/004–2019/V1). Voucher specimens were deposited at the Compton Herbarium (NBG).
MDP conceived the project and acquired funding. NMN provided supervision and resources. SDM generated and analysed the data, and wrote the manuscript. All authors reviewed the manuscript.
Seth D. Musker https://orcid.org/0000-0002-1456-1373
Nicolai M. Nürk https://orcid.org/0000-0002-0471-644X
Michael D. Pirie https://orcid.org/0000-0003-0403-4470
Raw reads are deposited in SRA under BioProject PRJNA1164706. Other associated data are deposited on FigShare at https://doi.org/10.25375/uct.27134208.
Supporting figures and tables
Data type: pdf
Explanation note: figure S1: Graphical illustration of the functionality of TargetVet. figure S2: Venn diagram showing the number of genes initially identified by MarkerMiner for each of the three Rhododendron transcriptomes. figure S3: Heatmap showing the number of paralogs (i.e., number of gene copies) identified by HybPiper’s length-based method, in which a targets is flagged for a given sample if its secondlongest assembled contig is more than 70% the length of its longest assembled contig. Targets and samples are arranged by mean number of copies. figure S4: Paralogy (P) estimated using the actual target sequences versus using their CDS versions. The solid line shows the linear regression line while the dashed line shows the 1:1 line. Points colours indicate missingness (M). figure S5: Tanglegram comparing the phylogenies inferred by concatenation (IQ-TREE; Left) and by ASTRAL (Right) using the full Erica303 target superset. For the concatenation tree, branch lengths are in substitutions per site and node labels are SH-alrt/UFBoot percentages. For the ASTRAL tree, branch lengths represent coalescent units (except for terminal branches which are are arbitrarily set to 1 as they are not estimated by ASTRAL) and node labels show posterior probability support. Nodes with full support are unlabelled. The trees are fully bifurcating and are rooted along the branch between the Erica and non-Erica samples arbitrarily for display purposes. table S1: Results of the fixed effects models of supercontig length as a function of target source showing that longer supercontigs were recovered by Erica genome-derived targets identified using NewTargets and MarkerMiner, whereas longer supercontigs were recovered by Rhododendron CDS-derived targets identified using the Refinement method. R2 indicates the fit of the full model, while Within R2 indicates the fit when fixed effects are ignored. Numbers in brackets are standard errors. table S2: Voucher information. Unless otherwise noted, collections were made by the author. Specimens have been deposited at NBG.