Plastome of Quercus xanthoclada and comparison of genomic diversity amongst selected Quercus species using genome skimming

Abstract The genus Quercus L. contains several of the most economically important species for timber production in the Northern Hemisphere. It was one of the first genera described, but genetic diversity at a global scale within and amongst oak species remains unclear, despite numerous regional or species-specific assessments. To evaluate global plastid diversity in oaks, we sequenced the complete chloroplast of Quercus xanthoclada and compared its sequence with those available from other main taxonomic groups in Quercus. We quantify genomic divergence amongst oaks and performed a sliding window analysis to detect the most variable regions amongst members of the various clades, as well as divergent regions occurring in specific pairs of species. We identified private and shared SNPs amongst oaks species and sections and stress the need for a large global assessment of genetic diversity in this economically and ecologically important genus.


Introduction
The genus Quercus in Fagaceae, a large and locally dominant family of trees in temperate forests of the Northern Hemisphere and in major regions of the tropics and subtropics, is one of the first plant genera described in history (Linnaeus 1753). Fagaceae comprise 8 well-recognised genera, holding ±900-1100 economically and ecologically important species (Manos et al. 2001). In the family, Quercus is the largest genus, with ±500-550 accepted species of broad-leaved, deciduous or evergreen trees and shrubs (Denk et al. 2017;Nixon 2006). Although best known from Europe and North America, it extends well into Central and South America, as well as the Asian tropical regions (Indo-China, Malesia) with large numbers of species (Soepadmo and Steenis 1972;Pulido et al. 2006;Phengklai 2008). South America appears to one of the most recent areas to be colonized by the genus, with Q. humboldtii occurring as far south as Colombia (Pulido et al. 2006). Until recently, the genus was divided in five subgenera and sections, based on molecular (Oh and Manos 2008), pollen morphology (Denk and Grimm 2009), morphological (Luo and Zhou 2002) and historical treatments (Camus 1936(Camus -1954. This was changed with a major update of the infrageneric classification (Denk et al. 2017). The genus is now divided into two subgenera holding a total of 8 sections (subg. Quercus: Protobalanus, Ponticae, Virentes, Quercus, Lobatae; and subg. Cerris: Ilex, Cyclobalanopsis, Cerris) (Denk et al. 2017). China, where oaks are among the main components of southern broad-leaved evergreen forests, holds the highest number of Asian species (Huang et al. 2009;Strijk 2019). Quercus and other Fagaceae are often the dominant element in tropical and subtropical evergreen forests (Huang et al. 1998) and, like Dipterocarpaceae in Asia, represent an exceptional ecological and economical resource (Cvetković et al. 2017;2019).
Chloroplast DNA loci have been widely used in plant studies, both for evolutionary studies and for identification purposes, due to their natural abundance in plant cells (≈3-5% of the cell DNA content), when compared to nuclear DNA. In angiosperms, the chloroplast genome is a circular molecule (76-217 kilobases), with a conserved structure of two inverted repeats (IR) separated by small (SSC) and large (LSC) single-copy regions (Jansen and Ruhlman 2012). However, until recently, the number of regions in the chloroplast genome, used to address evolutionary, taxonomic and biodiversity questions, remained limited. With the rapid development of Next Generation Sequencing (NGS) approaches, analysing the entire sequence of the chloroplast using a genome skimming approach (Straub et al. 2012) has become common practice and efficient, allowing for high resolution phylogenies (Ripma et al. 2014), resolved problematic taxonomic placements (Bock et al. 2014;Zedane et al. 2016), and estimates of genomic biodiversity (Särkinen et al. 2012;Staats et al. 2013;Bakker et al. 2016). Although geography can be a major factor in shaping plastome diversity at small scales (Pham et al. 2017), dominant patterns of genomic diversity are shaped by evolutionary processes. In oaks, plastome sequences of about 20 species have been studied and are currently available online (sometimes with numerous conspecific accessions, of which many lack essential voucher information), including Q. rubra (Alexander and Woeste 2014), Q. spinosa (Du et al. 2015), Q. aquifolioides and Q. aliena (Lu et al. 2016). These sequences can thus be used in reference-guided assemblies (Ripma et al. 2014) and applied to assess the chloroplast diversity in the genus. These species -in addition to the Q. xanthoclada chloroplast sequence we generated -belong to the four most diverse lineages (representing more than 97% of species-level diversity). The main aim of our study was to assess levels of genetic diversity contained within the chloroplast, using a subset of oaks from key clades and using a genome skimming approach to reconstruct the complete sequence of the chloroplast for Q. xanthoclada. Using xanthoclada, we performed phylogenetic and sliding window analyses, singlenucleotide polymorphisms (SNPs) detection and genomic comparisons of diversity.

Chloroplast reconstruction
Genomic DNA was extracted from 0.1 g of silica gel-dehydrated leaves using a protocol modified from Healey et al. (2014). Modifications were as follows: genomic DNA was extracted in 15 ml tubes, using 6 ml of extraction buffer, incubated at 65 °C for 60 min and two volumes of temperate absolute ethanol were added for precipitation, without incubation. Library construction and sequencing were performed by Novogene (Beijing, China), using NEBNext Ultra II DNA Library Prep Kit (Ipswich, Massachusetts, USA) and an Illumina HiSeq2500 platform (San Diego, California, USA), respectively, following specifications from the manufacturer. One Gb of raw data was generated and was imported in Geneious R11 (Biomatters Ltd, Auckland, New Zealand). Raw reads were trimmed according to their 5' and 3'-end quality and then mapped against the available chloroplast of Q. rubra (NC020152). Annotations were made using cpGAVAS (Liu et al. 2012) and boundary validation was performed, using ORF Finder (NCBI), with manual adjustments. The complete plastid genome was submitted to GenBank under accession number KU382355. The GenBank flatfile was used to generate a circular plastid genome map using the OrganellarGenomeDraw (OGDRAW) (Lohse et al. 2007).

Phylogenetic reconstruction
Following the recommended best practices for complete organellar sequencing (Botero-Castro et al. 2016), we performed a phylogenetic analysis to confirm the accuracy of our reconstructed plastid and our sample identification. We retrieved all the complete chloroplasts available in GenBank for the Fagales order (accessed 25 January 2019) and built both maximum likelihood and Bayesian trees using the PHYML (Guindon et al. 2009) implementation in Geneious R11 and BEAST 2.3.1 (Bouckaert et al. 2014), respectively. The ML tree was built using all positions, aligned using MAFFT (Katoh and Standley 2013) with default parameters and bootstrap values were calculated using 1000 replicates. We conducted BEAST analyses using separately aligned coding regions with MAFFT (Katoh and Standley 2013), --auto setting and then concatenated sections. Using only genes with annotations in all species, we excluded all those with ambiguous annotations. Analyses were performed using a GTR+I+G substitution model (4 gamma categories), a strict molecular clock model, 200 millions generations (burnin 10%) and Trigonobalanus doichangensis as outgroup. We checked chain convergence and Effective Sample Size (ESS) values > 200 using Tracer 1.6 (Rambaut et al. 2014).

Comparison amongst oak species
We used the retrieved oak chloroplasts to estimate the genetic divergence amongst chloroplasts using MEGA6 (Tamura et al. 2013). We used an in-house script (available upon request) to evaluate the variability of aligned sequences amongst these five species, with a window length of 500 bp and a step size of 250 bp. Pair-to-pair comparisons were visualised using mVISTA (Frazer et al. 2004). We also assessed the number of private and shared SNPs amongst all the combination of species using a specifically developed script (available upon request). We applied Tajima's relative rate test (Tajima 1993) on each species pair, using Lithocarpus balansae as outgroup to test the relative rate of evolution of Q. xanthoclada relative to other oak species.

Chloroplast reconstruction
62,060 trimmed reads were mapped on the Q. rubra chloroplast sequence, for a total linear length of 162,328 bp. The main sequencing depth was 47.7 (min: 2; max: 90; S.D.: 12.5), covering 100% of the reference sequence (161,304 bp). The mean mapping confidence was 35.3, with 94.5% of the bases with mapping quality > Q20, and 89.0% with quality > Q30. The frequencies of each nucleotide were 31.1% (A), 18.7% (C), 18.0% (G) and 32.2% (T), with 439 positions unresolved (N). The properties of the Quercus xanthoclada plastid are shown in Table 1 and in Fig. 1. The chloroplast molecule was 160,988 bp long and was 163 bp, 67 bp and 573 bp longer than those of Q. spinosa, Q. aliena and Q. aquifolioides, respectively. In contrast, it was 316 bp shorter than the chloroplast sequence of Q. rubra. The overall GC content was 36.7%. The chloroplast structure conforms to the standard found in most angiosperms, with two IRs (25,840 bp) separated by a LSC (90,353 bp) and a SSC (18,955 bp). The plastid genome contains 129 genes, including 90 coding proteins, 31 tRNA and 8 rRNA (Fig. 1). Amongst them, six (rpl2, ycf15 and ndhB duplicated in the IRs) and two included one or two introns, respectively.

Phylogenetic analysis and comparison amongst oak species and sections
The maximum likelihood (ML) tree of ten Quercus chloroplasts available in GenBank shows that Q. xanthoclada is closely related to Q. spinosa and to the clade formed by Lithocarpus, Castanea and Castanopsis (Fig. 2, left). In the Bayesian analysis, Quercus splits into two clades, one grouping Q. xanthoclada and Q. spinosa, the other containing the remaining species (Q. aliena, Q. rubra, Q. aquifolioides) (Fig. 2, right). The latter is the sister group of the clade comprising Lithocarpus, Castanea and Castanopsis species. Both trees are in general agreement, except for the placement of the clade containing Q. xanthoclada, that diverged after the other Quercus in the ML tree and before Q. xanthoclada this split in the Bayesian tree. Nodal support was high in both trees for all groups, except the Q. xanthoclada -Q. spinosa clade (ML, 74% bootstrap support) and the Lithocarpus divergence (BEAST, 0.85 PP). In both trees, the internal branches are very short, indicating more mutational events occurred in each species than amongst genera or clades (with the notable exception of Castanea). Furthermore, considering only the Quercus species, Q. xanthoclada appears to have evolved significantly faster than the other oak species (Table 2). All five oak species exhibited high overall similarities (99.4-99.6%) ( Table 3), but the 3766 SNPs detected were distributed unevenly across and within the plastid alignment (Fig. 3). Unsurprisingly, most of the variability amongst species was concentrated in the intergenic spacers and the gene introns. Four regions were especially SNPs-rich: the 5'end of the trnS-GCT-trnR-TCT spacer (49 SNPs, 750 bp), the psbM-trnD-GTC spacer (50 SNPs, 750 bp), the petA-psbJ spacer (55 SNPs, 750 bp) and the 3'end of ndhA (41 SNPs, 750 bp). Two regions exhibited a dramatic increase in the number of indel positions, namely the psbZ-rps14 interval (comprising trnI-TAT, trnfM-CAT and trnG-GCC, 348 indel positions per kb) and the 3' end of ndhA (188 indel position per kb). However, the high number of indel positions in the first of these intervals actually represents regions where more than 70% of the positions were deleted in at least one species (results not shown). The less variable regions were both the IRs, with almost no SNPs or indels.
However, this overall pattern varied when considering each species pair (Fig. 4). Despite most of the overall divergent regions being equally divergent in all species,  several variable regions appeared to be more species specific: atpF/atpH, the 5'end of trnT-GGT and the 5'end of the ndhA intron in Q. spinosa, the 3'end of trnS-GCT/trnR-TCT, the intron of rpoC1 and the trnD-GTC/trnY-GTA spacer in Q. rubra. In addition, both the psbZ/trnI-TAT (comprising trnG-GCC) and the rps15/ycf1 spacers were more divergent from Q. xanthoclada in Q. rubra and Q. aliena than in Q. spinosa and Q. aquifolioides. Interestingly, the divergent portions of the rpl32/trnL spacer were different for each species. No notable divergent regions were found in the two IRs in all species pairs.

Discussion
The genome-skimming approach is now widely used to reconstruct chloroplasts in angiosperms (Straub et al. 2012;Bock et al. 2014;Malé et al. 2014;Weitemier et al. 2014;Mandel et al. 2015) and an increasing number of Fagales are available (Alexander and Woeste 2014;El Mujtar et al. 2014;Hinsinger and Strijk 2015). Since the first Fagaceae (C. concinna; Hinsinger and Strijk 2015), the genome skimming approach has been proven to be useful in many non-model organisms (Kane et al. 2012;Nikiforova et al. 2013;Curci et al. 2016;Yu et al. 2018), despite a few disadvantages (Straub et al. 2012;Du et al. 2015), as genome skimming is a PCR-free method and a higher amount of starting material is required. Moreover, best results require high quality DNA (i.e high molecular weight DNA, usually > 10 kb, with little degradation), which is incompatible with large-scale (i.e. phylogenetic or population genomic) sampling. Nonetheless, the protocols can easily be adapted (e.g. using specific library construction kits (Zedane et al. 2016) or skipping the shearing step in degraded samples like herbarium or archaeological specimens (Särkinen et al. 2012;Staats et al. 2013). Several examples showcasing the strength of these approaches have been published. For example, Zedane et al. (2016) explored the evolutionary relationships of Hesperelaea, an extinct Oleaceae, by reconstructing the plastome of a century-old herbarium specimen and Renner et al. (2019) identified 3,500 years old archaeological remains as watermelon, based on their plastome sequence. With the constant progress in sequencing technology and library construction and the development of genomic resources for oaks (Kremer et al. 2012;Hipp et al. 2014), we can expect that these limitations will soon be overcome. In addition to the complete chloroplast sequence, the genome skimming can also be use to retrieve nuclear regions found in high copy number in the genome, such as the nuclear ribosomal cistrons (NRC) (Straub et al. 2012;Ripma et al. 2014;Weitemier et al. 2014). To evaluate the relative variability of the complete chloroplast sequence versus the nuclear ribosomal cistrons, we attempted to retrieve the available raw data for the considered oak species from the Short Reads Archive (SRA). Despite the stringent chloroplast purification protocol described in Du et al. (2015), Q. spinosa was the only species where we were able to retrieve the NRC, following the same protocol as described above, mapping the trimmed reads against the nuclear internal transcribed spacer of Q. robur (FM244246) (results not shown).
In our study, the plastome of Q. rubra was the largest in size in oaks (161,304 bp), but in other studies, it was described as the second smallest in oaks (Alexander and Woeste 2014). Given the species diversity in oaks and with an increasing number of plastomes available, this view is likely to change in the near future.
Quercus is widespread throughout the whole of the northern hemisphere, but in our study, three of the four available species came from Asia. To fully capture the chloroplast sequence diversity on a global scale, future inclusion of and genomic comparison with American and European Quercus is needed. Although species in this study are members of different recognised sections and subgenera, it is likely that they represent only a fraction of the total diversity within each of these groups and inclusion of additional members will reveal more about the extent and distribution of plastome diversity on various taxonomic and global scales.
Interestingly, one of the regions showing a relatively high level of variation amongst oak species is the rpl32-trnL spacer. As the observed variations are located in different portions of the region, it is likely that these SNPs and indels represent section specific diagnostic regions. Indeed, rpl32-trnL has been hypothesised as a DNA-barcode locus in several groups (Shaw et al. 2007;Arca et al. 2012). Other proposed DNA-barcode loci showed different levels of variation, rbcL being the least variable, with no indels and very few SNPs, as previously demonstrated in several woody plants (Chase et al. 2005;Roy et al. 2010;Arca et al. 2012;Clement and Donoghue 2012). In contrast, trnH-psbA has been shown to be highly variable (Piredda et al. 2010;Arca et al. 2012;Clement and Donoghue 2012;Hinsinger et al. 2013), mostly for SNPs, while rpl32-trnL was more variable for indels. Briefly, the proposed standard DNA-barcode loci (namely rbcL and matK (Hollingsworth et al. 2009)) would likely fail in oaks, as demonstrated for certain subclades or specific geographic areas (Piredda et al. 2010;Simeone et al. 2013). However, not only the divergence of putative barcode loci amongst distantly related species has to be assessed, but also between closely related species, as well as amongst different populations. We highlight that rpl32-trnL, as well as trnH-psbA, should be assessed for DNA-barcoding purpose in oaks, in addition to a thorough re-assessment and estimation of the usefulness of the proposed standard loci rbcL and matK.
Most of the SNPs were either specific to one species or shared by four species and only a few shared by only two species (Fig. 5). Q. xanthoclada exhibited the highest number of species specific SNPs (429), whereas Q. aliena and Q. rubra segregated the lowest number of SNPs (277). Accordingly, Q. xanthoclada was excluded from the four species group that shared the highest number of SNPs (418), followed by Q. aquifolioides (348).
Although these results, in combination with those obtained in Tajima's relative rate test, seem to suggest a relative distinction of the Cyclobalanopsis species and their previous separation as a separate subgenus in the genus Quercus, this is not corroborated by any other data in our study. Neither the complete chloroplast phylogeny, nor previous studies based on nuclear and chloroplast loci (Manos et al. 2001;Oh and Manos 2008), justify this separation of Q. xanthoclada (and other taxa under the generic or subgeneric header of "Cyclobalanopsis") from other oaks. Our work will allow for the development of new loci to be used in comparative phylogeography of the different section and subgenera (i.e. primers that are easily transferable amongst the different sections that can occur in sympatry), as well as open up new perspectives for conservation, management and the use of DNA fingerprinting to aid tracking of wood products from subtropical and tropical Asian oak species.