Genetic diversity and population structure of Cinnamomum balansae Lecomte inferred by microsatellites

Abstract Cinnamomum balansae Lecomte (Lauraceae), an economically important forest tree, is distributed in the tropical forests of central and northern Vietnam, which has been threatened in recent decades due to the destruction of its habitat and over-exploitation. The genetic diversity and population structure of the species have not been fully evaluated. We used a set of 15 microsatellites to analyze 161 adult trees from 9 different populations, representing the geographical distribution of C. balansae. Ninety-two different alleles were identified. Here our results showed a low genetic diversity level with an average H o = 0.246 and H e = 0.262, and a high level of genetic differentiation (F ST = 0.601). The bottleneck tests indicated evidence of a reduction in the population size of the two populations (TC and CP). Additionally, all three clustering methods (Bayesian analysis, principal coordinate analysis, and Neighbor-joining tree) were identified in the two genetic groups. The Mantel test showed a significant positive correlation between genetic distance and geographic distance (R 2 = 0.7331). This study will provide a platform for the conservation of C. balansae both in ex-situ and in-situ plans.


Introduction
Cinnamomum balansae Lecomte, a large and evergreen tree of the Lauraceae family, is distributed in the tropical forests of central and northern Vietnam [1][2][3][4]. It is found at an elevation above 200 m in secondary forests on the ancient alluvial rocks and granite that have low relief and gentle slopes. The species prefers precipitation of 800-2,500 mm and an annual mean temperature of 20-22°C [2,4]. C. balansae can reach a height of 30 m with a diameter of 85-90 cm at breast height. Flowers are small, greenish-yellow, and appear between March and May. Fruiting appears from April to July, with fruit and maturation occurring in October [2]. Its bark has oil and camphor. The fruit oil and fat are used to make soap. Wood is used for furniture. In the past few decades, this species has been over-exploited by local people and forestry enterprises [1]. Its habitats have been destroyed and degraded [5,6]. For that reason, C. balansae is listed as an endangered species in the IUCN Red List [7] and the Vietnam Red Data Book [1].
The adaptive and evolutionary potential of a species depends on its genetic diversity [8]. Conservation and management of a species require knowledge of its ecological characteristics and genetic variation within and between populations [9][10][11]. To obtain such information, particularly a better understanding of genetics, powerful and biological techniques are required [12,13]. Molecular markers such as allozyme [14], randomly amplified polymorphic DNA (RAPD) [15,16], and chloroplastic DNA [17] have been used to investigate the genetic diversity of species in the Lauraceae family. Microsatellites (SSRs) are widely distributed throughout the nuclear genome, displaying high polymorphism and reproducibility [18][19][20]. These markers have been widely used to study the Lauraceae species [21][22][23][24][25][26][27][28][29]. In the current study, microsatellites were used to assess the genetic diversity and population structure of C. balansae and provide a platform for effective conservation strategies of this species in Vietnam.

Plant materials
With Illumina sequencing for the development of microsatellites, we collected three plant tissues, including roots, leaves, and stems of C. balansae (one individual) from a wild population (Ba Vi National Park, Hanoi city). All sample tissues were immediately frozen in liquid nitrogen and stored at -80°C until RNA extraction.
To assess the genetic variation in and population structure of C. balansae, the study area for this species was carried out in nine sites, one each in Hanoi, Phu Tho, Tuyen Quang, Yen Bai, Son La, Hoa Binh, Ninh Binh, Thanh Hoa, and Nghe An, representing its natural range in Vietnam (Table 1, Figure 1). Sampling locations were recorded using a global positioning system (GPS). The inner bark from 161 adult trees was randomly sampled for 9 known populations in central and northern Vietnam. Samples were placed into plastic bags containing silica gel in the field and then transferred to the Laboratory of Molecular Biology of Vietnam-Russia Tropical Centre and stored at −30°C until DNA extraction.

DNA/RNA extraction, cDNA sequencing, and microsatellite loci identification
For Illumina sequencing, total RNA from each sample was isolated using the OmniPlant RNA Kit (DNase I) (www.cwbiotech.com). Both quantity and quality of total RNA were checked through Nanodrop ND-2000 spectrophotometer (Thermo Electron Corporation, USA) and Agilent 2100 Bioanalyzer. An equal amount of total RNA from each sample was pooled together and sent to Breeding Biotechnologies Co. Ltd for cDNA library construction and transcriptome sequencing using Illumina HiSeq™4000. Cleaned mRNA was used for cDNA library construction extracted from 200 µg total RNA using Oligo (dT). The cDNA first strand was prepared from random showing adaptor contamination, length < 36 bp, and low-quality value (quality <20) higher than 15% were eliminated. Trinity [31] with default parameters were used for de novo assembly of the cleaned reads. Then, the TIGR Gene Indices clustering tool (TGICL) v2.1 [32] was used to cluster and eradicate redundant transcripts and identified unigenes for further analysis. SSRs were detected in the Simple Sequence Repeat Identification Tool program (SSRIT, http://www.gramene. org/db/markers/ssrtool). Only unigenes longer than 1 kb were included in the EST-SSR detection. The parameters were set to detect perfect, di-, tri-, tetra-, penta-, and hexanucleotide motifs with a minimum of 6, 5, 5, 5, and 5 repeats, respectively. Identification of microsatellite loci showed simple sequence repeats and was deposited in GenBank.
Total genomic DNA was extracted from the samples using the Plant Genome DNA Extraction Kit (BioTeke, Beijing, China). The quality of total DNA was verified by 1% agarose gel electrophoresis and NanoDrop2000 (BioTeke Instruments, Winooski, VT, USA). The concentration was then diluted to 20 ng/µL and stored at −20°C.

Microsatellite amplification
Polymerase chain reaction (PCR) was performed in a 25 μL of reaction mixture containing 2.5 µL of template DNA, 12.5 µL of 2X Taq Master mix, 1 µL of each primer, and 8 µL of deionized water. All SSR primers were designed using Primer 5.0 software [33]. The main stated selection criteria for the screening of primers were as follows GC content of 45-55 with 55% as the optimum; primer length of 18-24 bp with 20 bp as the optimum length; annealing temperature from 55 to 65°C, with 60°C as the optimum temperature; and PCR product size ranging from 100 to 300 bp. Fifteen SSR polymorphic markers were used in the present study ( Table 2). The PCR amplification was carried out on a Mastercycler X50s Eppendorf, under the following thermal cycler: initial denaturation at 94°C for 2 min, followed by 40 cycles consisting of 1 min at 94°C for denaturation, 30 s alignment at the annealing temperature for each primer pair at 55°C and 1 min at 72°C for extension, and 10 min at 72°C for final cycle to complete the extension of any remaining products before holding the samples at 4°C until they were analyzed. The amplification products were separated on 8% polyacrylamide gels in 1 × TAE buffer using the Cleaver Scientific electrophoresis system (UK) and visualized with GelRed TM 10000X. The sizes of the PCR products were detected and analyzed using GelAnalyzer software with a 20 bp DNA ladder (Invitrogen).

Data analysis
Genetic diversity was estimated using GenAlEx [34] based on SSR allele frequencies, including the number of alleles per locus (N a ), effective alleles (N e ), private alleles (N p ), the proportion of polymorphic loci (P%), the coefficient of total inbreeding (F IT ), the genetic differentiation (F ST ), Nei's genetic distances among the populations, the observed heterozygosities (H o ), and the expected heterozygosities (H e ). Gene flow between populations (N m ) was calculated using the F ST value: N m = (1/F ST − 1)/4. The inbreeding coefficients (F IS ) value was also calculated for each population by Fstat [35]. Tests for deviation from the Hardy-Weinberg equilibrium at each locus and  the linkage disequilibrium for each locus pairwise combination in each population were also performed by Genpop v.4.6 [36]. Arlequin v.3.0 [37] was used to perform an analysis of molecular variance (AMOVA) for the estimation of genetic divergence as PhiPT, and its associated significance based on 10,000 permutations. Testing for recent bottleneck events for each population using Bottleneck v.1.2 [38] was applied with the stepwise mutation model (SMM) and two-phase model (TPM), and their significance was tested. The proportion of the SMM was set to 70% under default settings. The significance of these tests was evaluated by the one-tailed Wilcoxon signed-rank test. Principal coordinate analysis (PCoA) was implemented based on G′ ST values [39] using GenAlEx. Neighbor-joining (NJ) tree was constructed genetic relationships among populations based on a matrix of F ST values using Pop-tree2 [40]. Bayesian analysis of population structure was performed with Structure v.2.3.4 [41] to determine the optimal value of the genetic cluster (K). Setting the admixture model with correlated allele frequency and the ancestry models, 10 separate runs of the number of clusters (K) in the dataset were performed from 1 to 10 for each K value at 500,000 Markov Chain Monte Carlo (MCMC) repetitions and a 100,000 burn-in period. The optimal value of K was detected using Structure Harvester [42] based on the ΔK value by Evanno et al. [43].
A Mantel test [34] was applied to determine the relationships between genetic and geographic distances among the C. balansae populations based on 999 permutations.

Genetic diversity
A total of 92 different alleles were detected from 15 SSR markers in C. balansae in the tropical forests of Vietnam. Allele numbers ranged from 4 (VD51 and VD81) to 9 (VD08), with an average of 6.1 ( Table 2). Among these, 35 alleles were private. Loci VD08 and VD49 had the greatest number of private alleles (4). One private allele at 2 loci (VD31 and VD51), 3 private alleles at 3 loci (VD07, VD75, and VD108), and 2 private alleles at 8 remaining loci were recorded.
Genetic diversities for C. balansae populations are presented in Table 3. The average percentage of polymorphic loci was 68.15%, ranging from 60% in the 4 populations of BV, SL, TQ, and NA to 86.67% in the TH population. The average value of alleles (N a ) per population was 1.9, ranging from 1.6 (SL) to 2.2 (YB and TH).
The highest number of private alleles (N p = 8) was detected in the CP population, followed by three populations of YB, BE, and QH with N p = 6, two populations of MC and DH with N p = 4, the BV population (N p = 2), and two populations (TC and SD) with N p = 1. The N e value varied from 1.

Population structure and Mantel test
The average F IT value was 0.624, ranging from 0.06 (VD108) to 0.992 (VD25), indicating a deficiency of heterozygosity in the populations ( Table 2). The genetic differentiation among populations (F ST ) varied from 0.285 (VD108) to 0.992 (VD25) with an average of 0.601, indicating high differentiation among all the populations and reflecting low gene flow (N m = 0.211). Our results of the AMOVA showed that the proportion of genetic variation among the populations was higher (74%) than within populations ( Table 4). The PhiPT value and gene flow among populations were 0.744 and 0.086, respectively.
The PhiPT value for the variation among populations was significant (p < 0.001). The Mantel test revealed a significantly high correlation between genetic distance and geographical distance (R 2 = 0.7331) (Figure 2), indicating high genetic differentiation among populations.
Genetic clusters were distinguished through the NJ analysis by using Poptree2 and the PCoA based on G′ ST values (Figure 3). Two different clusters were generated. The first cluster included three populations (CP, BE, and QH) in the central region, while the second included the remaining six populations (DH, SD, BV, MC, YB, and TC) in the northern region. An admixture model was performed to evaluate the population structure of the 161 individuals of C. balansae based on the Bayesian analysis using the Structure program. The results showed that the optimum number of genetic groups (K) was two, with the highest value of delta K (1,145.4) obtained from Structure Harvester might an optimum number of genetic clusters ( Figure 4). Each individual in each population was assigned to one of the two genetic clusters. The ancestry coefficients of each individual and each population are shown in Figure 4. The red cluster represented all individuals from the populations of BV, MC, DH, TC, SD, and YB, while the green cluster represented CP and QH populations. All the individuals in the BE population were separated and closer to the red cluster than the green cluster.

Discussion
The genetic diversity of a species is often correlated with its history and ecology [44][45][46]. The genetic diversity is positively correlated with geographic distribution range and population size [47]. Species with widespread distribution, large population size, longevity, and outcrosses often maintain high genetic diversity compared to species with a narrow distribution [48]. Here our results indicated that C. balansae has low genetic diversity    [50]. Although this species has a widespread distribution and long life, the decline in genetic diversity may be related to anthropogenic disturbances, such as deforestation, habitat degradation, and overexploitation of C. balansae [51]. Furthermore, in the current study, positive F IS values were observed in the five populations of BV, TC, SD, CP, and QH, while the negative F IS values were found in the remaining four populations of MC, DH, YB, and BE. A heterozygosity deficit was determined by the Bottleneck analysis, which detected a reduction in the population size of C. balansae. Small populations might be the result of inbreeding, which can decrease genetic variation. However, the bottlenecks also show a rapid reduction in the population sizes of TC and CP in recent decades. A high level of genetic differentiation (F ST = 0.601) was determined, and reflected low gene flow (N m = 0.211) among C. balansae populations. This is in contrast to previous reports on widespread species, such as Dalbergia cochinchinensis (F ST = 0.23) [52] and Erythrophleum fordii Oliv (F ST = 0.18) [53]. The dispersal of pollen and seeds plays an important role in maintaining high genetic diversity within and low genetic differentiation among populations [54,55]. The large geographic distance between C. balansae populations may be responsible for high genetic differentiation. Moreover, habitat fragmentation can increase the isolation between populations. Thus, these suggested the geographic distance and habitat fragmentation as a barrier to restrict gene flow between populations. Population structure and genetic relationships are important for establishing the appropriate scale and subunits for species conservation and management [56]. The study also identified two genetic groups using NJ, Structure, and PCoA: one in northern Vietnam (DH, SD, BV, MC, YB, and TC populations), and the other in central Vietnam (BE, CP, and QH populations). The Mantel test shows that there are almost obvious geographical boundaries between the studied populations, and it indicated the effects of the geographic distance on the number of migrants per generation inferred from the studied loci. This suggests that past human activities can have a significant impact on the present population structure and genetic diversity of C. balansae in Vietnam. Thus, the best conservation strategy should be to establish an ex-situ conservation site for this species through collecting seeds, germinating, and cultivating seedlings. Collecting seeds from individuals of all the populations should be conducted so that random loss of genetic variability due to loss of individual ecotypes should be avoided and move towards new self-sustaining populations. The establishment of a new large population of C. balansae should provide new alleles, which might improve its fitness under different environmental stresses to prevent the gradual reduction in the population of C. balansae.

Conclusion
These results confirmed that C. balansae had low genetic diversity and high genetic differentiation among populations in the evergreen tropical forests. The nine populations of C. balansae were divided into two genetic groups related to the geographic distance. Our study will contribute to understanding the population history of this species and the genetic relationships among the C. balansae populations.