Distinguish Dianthus species or varieties based on chloroplast genomes

Abstract Most plants belonging to the widely distributed genus Dianthus are used for gardening. Interspecific hybridization of different Dianthus species leads to blurred genetic backgrounds. To obtain more genomic resources and understand the phylogenetic relationships among Dianthus species, the chloroplast genomes of 12 Dianthus species, including nine Dianthus gratianopolitanus varieties, were analyzed. The chloroplast genomes of these 12 species exhibited similar sizes (149,474–149,735 bp), with Dianthus caryophyllus having a chloroplast genome size of 149,604 bp marked by a significant contraction in inverted repeats. In the chloroplast genome of Dianthus, we identified 124–126 annotated genes, including 83–84 protein-coding genes. Notably, D. caryophyllus had 83 protein-coding genes but lacked rpl2. The repeat sequences of the chloroplast genome were consistent among species, and variations in the sequence were limited and not prominent. However, notable gene replacements were observed in the boundary region. Phylogenetic analysis of Dianthus indicated that D. caryophyllus and D. gratianopolitanus were most closely related, suggesting that the degree of variation within nine Dianthus varieties was no less than the variation observed between species. These differences provide a theoretical foundation for a more comprehensive understanding of the diversity within Dianthus species.


Introduction
Dianthus is a genus comprising approximately 300 species in the Caryophyllaceae family.Species within this genus are widely distributed across Europe and Asia, particularly in the Mediterranean region, as well as in America and Africa [1].Several species, notably Dianthus caryophyllus L., D. barbatus Linn., D. chinensis L., D. plumarius L., and D. superbus L., find extensive use in horticulture because of their beautiful flowers [1].Moreover, many Dianthus species have been used in traditional Chinese medicine [2].Two novel triterpenoid saponins with antimicrobial activity have been isolated from D. erinaceus Boiss.[3].Furthermore, recent years have witnessed progress in research on Dianthus species, encompassing investigations into their morphological structure [4], molecular markers [5,6], environmental adaptation [7], and breeding techniques [1,8].Because of the extensive utilization of Dianthus plants, the most commercially vital cultivars are hybrids that undergo vegetative propagation, and there has also been the development of transgenic varieties [8,9].The use of advanced breeding technologies has expanded the diversity of Dianthus species, resulting in challenges when distinguishing these variations.This study underscores that morphological traits are the least effective means of identification, emphasizing the necessity for genetic markers in this regard [5].
Chloroplasts are vital organelles within green plant cells, closely involved in processes such as photosynthesis, carbon fixation, amino acid synthesis, and various other cellular functions [10].In recent years, chloroplast DNA (CpDNA) has garnered increasing attention.According to the endosymbiotic theory, CpDNA is assumed to have originated from primordial photosynthetic prokaryotic cells associated with cyanobacteria [11].CpDNA typically exhibits four regions: one large single-copy (LSC) region, one small single-copy (SSC) region, and a pair of inverted repeats (IRa and IRb) [12].The absence of IRs in CpDNA is also widespread in many species, such as Vicia sepium [13].CpDNA has single-parent genetic characteristics.Previous studies have shown that most gymnosperms inherit their chloroplasts paternally [14], whereas in angiosperms, chloroplast inheritance is primarily maternal [15].Because of the lower replacement rates of CpDNA than that of nuclear genomes, CpDNA is frequently employed in the construction of evolutionary trees [16,17].Extensive gene exchange occurs between the chloroplast genome, mitochondrial genome, and nuclear genomes, further diversifying the functions of chloroplasts [18][19][20].The expression of genes encoded by CpDNA largely depends on a comprehensive set of factors of nuclear origin [21].In recent years, CpDNA has been used to distinguish closely related species via comparative analysis [22][23][24][25][26].
D. caryophyllus, commonly known as carnation, is one the best-known species within the Dianthus genus [27].The flowers of D. caryophyllus are renowned for their vibrant colors with strong aromas, including red, white, yellow, and green.The original natural flower color is a bright, pinkish-purple.They not only have ornamental value as cut flowers but also have medicinal properties.Our research team reported the characteristic CpDNA of D. caryophyllus, which is 147,604 bp long [28].D. gratianopolitanus Vill., an endangered plant species, has a highly fragmented distribution range comprising numerous isolated populations [29].To date, nine CpDNA sequences of D. gratianopolitanus varieties have been listed in the National Center for Biotechnology Information Database (https:// www.ncbi.nlm.nih.gov/).However, there have been limited studies focusing on the genetic relationships and phylogenesis of Dianthus species.To explore the genetic characteristics of Dianthus species, a comprehensive comparison of CpDNA was conducted among different species and varieties of Dianthus.This study aims to test the hypothesis that differences in CpDNA characteristics among different species are greater than those among different varieties.The results of this study will not only enhance our understanding of the chloroplast genome characteristics of Dianthus but also provide valuable insights into their genetic diversity.

Data collection and structure comparison
The CpDNA of D. caryophyllus was previously sequenced and published by our research team, under the accession number MG989277 [28].A circular diagram was generated using OGDRAW 1.31 [30] 1.The GC content and length of the CpDNA as well as the LSC, SSC, and IR regions of these 12 species were calculated using BioEdit V7.0 [31].
The number of protein-coding genes was counted.Structural variations were detected using mVista in the shuffle-LAGAN model [32].The sequences were visually checked for their identity using BLAST Ring Image Generator by aligning the 12 Dianthus species genomes with D. caryophyllus as a reference [33].Simultaneous visual comparisons of the IR/LSC and IR/SSC border regions of the CpDNA from nine species were conducted using the drawing software Visio 2013, following the method described by Zhang et al. [34].

Analysis of codon usage bias
Codon usage bias reflects the balance between mutation bias and natural selection.In this study, we described the codon usage bias of CpDNA in all 12 selected plants using the relatively synonymous codon usage (RSCU) values, which were determined using DAMBE software [37].An RSCU value of 1.00 indicated that the codons had no usage bias, whereas a value greater than 1.00 indicated a higher than expected usage frequency of the codon.To illustrate the distribution of codon usage preferences among these plants more clearly and intuitively, a heatmap based on the RSCU values using HemI software (version 1.0) was constructed [38].

Analysis of molecular evolution
Ka/Ks represents the ratio between the number of nonsynonymous substitutions per non-synonymous site (Ka) and the number of synonymous substitutions per synonymous site (Ks) in two protein-coding genes.It is used to determine whether there is selective pressure on the proteincoding gene [17].If Ka/Ks > 1, the gene is considered to have undergone positive selection.If Ka/Ks = 1, the gene is considered to have evolved neutrally.If Ka/Ks < 1, the gene is considered to have undergone purify selection.Ka/Ks was calculated using D. caryophyllus as the reference, and the DnaSP v5 software was employed [39].

Phylogenetic analysis
The MEGA 7 software was employed to construct the phylogenetic tree using the neighbor-joining method, which is a simplified version of the minimal evolution method, for the aforementioned 12 CpDNAs [40].The Kimura two-parameter model was selected using the corrected Akaike information criterion, and the bootstrap number was set to 100,000.The minimum value of the sum of all branches (S) was used as an estimate of the correct phylogenetic tree.
3 Results and discussion

CpDNA characteristics of Dianthus species
A comparison of the nine different varieties of D. gratianopolitanus revealed that although there were similarities

Comparative genomic analysis of the Dianthus complete chloroplast genomes
Via multiple sequence alignment analysis of the nine D. gratianopolitanus varieties (Figure 1), we observed significant deviations primarily in the conserved non-coding sequences (CNS), tRNA, and rRNA.Notably, variant sites in the protein-coding sequences were primarily concentrated in the atpF gene in D. gratianopolitanus (dg1255) and D. gratianopolitanus (dg1370), the ycf3 and petD genes in D. gratianopolitanus (dg3051), and the clpP gene in D. gratianopolitanus (dg1602) and D. gratianopolitanus (dg2868).
Compared to the protein-coding sequence of D. caryophyllus, which exhibited notable distinctions in clpP, atpF, ycf3, the most significant divergence was observed in rpl16.Analyzed accessions of other 11 Dianthus species displayed conspicuous deletions in these gene.Overall, the dissimilarities between D. caryophyllus and D. gratianopolitanus (dg1869) were relatively minor, and their gene sequences are high degree of similarity.A comparative circular graph visually highlights the presence of genes, deletions, and sequence variations.The results indicate a high degree of sequence conservation, among the genomes (Figure 2).
Comparing the genes at the boundaries of the LSC, SSC, and IR regions (Figure 3), it becomes evident that the most significant difference among the nine D. gratianopolitanus varieties lies in the psbA (1,062 bp) within the LSC region of D. gratianopolitanus (dg1869).This same position was observed for trnH in eight other varieties.Although there were minor variations in the CpDNA of the nine D. gratianopolitanus species, overall, the CpDNA exhibited a high degree of conservation.
Figure 3 also highlights substantial differences between D. caryophyllus and the other 11 species in terms of genes at the boundaries of different regions.Notably, the LSC/IRa junction features ycf2, whereas rps19 occupied this position in the other 11 species and had a similar segmentation length.In the IRb/LSC junctions, a duplicated ycf2 gene replaces the rpl2 gene, whereas the rpl2 gene appears at the same connections as in other species.The ycf1 gene in the IRb region is slightly longer in D. caryophyllus, differing by just 3 bp among the other 11 analyzed accessions chloroplast genomes (Figure 4).

Repeat sequence features
Repeat sequence analysis was conducted on the CpDNA of the nine D. gratianopolitanus varieties, resulting in the identification of 49, 49, 49, 49, 49, 49, 42, 44, and 42 repeat sequences, respectively.These numbers were similar to those found in D. longicalyx (49) and D. moravicus (47) but were generally higher than those of the reference species, D. caryophyllus, which had only 42 repeat sequences (Figure 5a).Not all species contain all four types of repeat sequences.For instance, D. caryophyllus had only 10 forward, 20 palindromic, and 12 reverse repeat sequences.
Only five species contained a very limited number of complementary repeat sequences, with D. gratianopolitanus (dg2769) having the most, but still only up to four.In contrast, the average numbers of forward, reverse, and  palindromic repeats detected in the 12 CpDNAs were as high as 14, 12, and 20, respectively.As depicted in Figure 5b-d, palindromic repeat sequences were the most numerous, followed by forward and reverse repeats.Additionally, the sequence length of palindromic, forward, and reverse repeats was primarily concentrated in the 20-29 bp, followed by the 30-39 bp range.Repeat sequences longer than 59 bp, which are exclusively forward repeated sequences, were detected in D. gratianopolitanus (dg1602) and D. gratianopolitanus (dg3051).
SSR analysis identified a total of 134 SSRs with 13 combined types in D. caryophyllus CpDNA (Figure 6a).Among these 134 SSRs, 55 mononucleotide, 56 dinucleotide, 8 trinucleotide, 12 tetranucleotide, and 3 pentanucleotide repeats but no hexanucleotide repeats were detected.Furthermore, the frequencies of short polyadenine or polythymine repeats were significantly higher than those of tandem cytosine (C) or guanine (G) repeats.The most frequent composed dinucleotide repeat was AT/TA.In addition, the CpDNA of the nine D. gratianopolitanus varieties contained 142, 150, 147).Among the six types of SSR sequences in the 12 species, mononucleotide and dinucleotide repeats were the most common, with the largest numbers ranging from 55 to 67 (Figure 6b).The numbers of trinucleotide and tetranucleotide repeats were similar, whereas pentanucleotide repeats were less frequent, with each species averaging only approximately four.Hexanucleotide repeats were detected in the CpDNA of only four species, with each species having only one, except for D. caryophyllus, which did not contain any hexanucleotide repeats.

Codon usage
The codon usage bias of the 20 amino acids and stop codons in D. caryophyllus is depicted by the RSCU value (Figure 7).The RSCU value of each synonymous codon exhibited significant variation.Nearly all of the amino acids in protein-coding regions contained synonymous codons, except methionine and tryptophan.Among these, arginine and leucine displayed the highest codon usage preferences, with identical RSCU values, followed by serine.The heat map of all 12 species reveals variations in codon usage preferences between species and varieties (Figure 8).Notably, more than half of the  codons exhibited RSCU values greater than 1.Additionally, we observed a preference for codons ending with A or U over those ending with C or G.

Evolutionary rate analyses
In the evolutionary comparison of protein-coding genes with these 11 species (Figure 9), we observed that the Ka/ Ks ratios among the nine D. gratianopolitanus varieties were similar.A significant proportion of protein-coding genes had Ka/Ks ratios less than 1, indicating that the nonsynonymous sites were fewer than the synonymous sites, and these protein-coding genes underwent purify selection.Among all these species, only four genes (atpB, ndhF, psbB, and ycf1) had a value greater than 1, indicating positive selection.Comparatively, the ccsA of D. moravicus had a ratio greater than 1, whereas the ratios of the other species were slightly less than 1.A similar pattern was observed for rpoC2 of D. gratianopolitanus (dg1869) and D. gratianopolitanus (dg2134).Additionally, none of the protein-coding genes exhibited evidence of neutral evolution.Worth noting is that the Ka/Ks ratios of rpoA, associated with RNA polymerases in D. gratianopolitanus (dg1602), D. gratianopolitanus (dg3051), and D. gratianopolitanus (dg2769) varieties were notably high, indicating strong positive selection.

Phylogenetic analysis
Phylogenetic analysis is a common method for studying species evolution and phylogenetic classification, forming the core of our understanding of biodiversity, evolution, and genomics.The phylogenetic tree divided these species into three major evolutionary branches (Figure 9).Specifically, D. gratianopolitanus (dg2134), D. gratianopolitanus (dg1370), D. gratianopolitanus (dg2868), and D. longicalyx formed one evolutionary branch, with a 100% bootstrap value.Meanwhile, D. caryophyllus and D. gratianopolitanus (dg1869) were divided into two main branches that diverged from another evolutionary branch.Similar to the trend of the RSCU cluster trend, although most plants within a species exhibited similarity, some plants showed closer evolutionary relationships between species (Figure 10).

Conclusion
In this study, we analyzed the structure, size, number, and type of genes and the codon usage and repeat sequences of CpDNA for 12 Dianthus species, including nine different varieties of D. gratianopolitanus.The size and GC content of CpDNA were similar among Dianthus species.Notably, D. caryophyllus had the smallest cpDNA, with significant contraction, but possessed the largest LSC region.D. caryophyllus was an exception, having 83 protein-coding genes because of the lack of the rpl2 gene, whereas the other 11 species had all 84 protein-coding genes.Comparative analysis showed that the 12 CpDNAs of Dianthus species were highly conserved and that only a few sites of subtle variation existed.Gene analysis of the boundary regions showed that there were evident differences in the genes.The hypothesis is not supported by the data, as the differences within species were not less than the intra-species differences.More CpDNA data from Dianthus plants may be needed to explore molecular markers for distinguishing between different species.

Figure 1 :
Figure 1: Circular map of the chloroplast genome of the D. caryophyllus.Genes drawn within the circle are transcribed clockwise, while genes drawn outside are transcribed counterclockwise.Genes belonging to different functional groups are color coded.Dark bold lines show inverted repeats (IRa, and IRb).The dashed area in the inner circle indicates GC content in the chloroplast genome.The map is drawn by OGDRAW.

Figure 2 :
Figure 2: Visual alignments of chloroplast genome sequences using the D. caryophyllus as the reference genomes.Arrows indicate the annotated genes and their transcriptional direction, and genome regions are color coded as exon, tRNA or rRNA, CNS (conserved non-coding sequences), and mRNA.

Figure 3 : 7 Figure 4 :
Figure 3: Comparative circle graph of similarity of the chloroplast genome sequences.The central ring represents the D. caryophyllus as the reference genome.There are 11 close related species plants from the inside to the outside.The outermost circle is the protein-coding gene of the D. caryophyllus.

Figure 5 : 9 Figure 6 :
Figure 5: Long repeat sequence analysis in the chloroplast genome: (a) number of four different directions, (b) number of the palindrome directions repeat length, (c) number of the forward directions repeat length, and (d) number of the reverse directions repeat length.

Figure 7 :
Figure 7: RSCU value of amino acid and stop codon in the chloroplast genome of D. caryophyllus.The color of the histogram corresponds to the color of the codon.

Figure 8 :
Figure 8: Heat map of codons usage of 12 species.Different colors represent different RSCU values, red indicates a higher RSCU value and blue indicates a lower RSCU value.

Figure 9 :
Figure 9: Ka/Ks comparison of protein genes with selective stress.The Ka/Ks values of other unlisted genes are zero.

Figure 10 :
Figure 10: Phylogenetic analysis of the D. caryophyllus and 11 close relationship species by neighbor-joining method.The number indicates the NJ bootstrap value.The ruler indicates the genetic distance.

Table 1 :
Summary of chloroplast genomic for D. caryophyllus and other 11 Dianthus plants in genome length, GC content, and the number of CpDNA genes, some differences persisted (Table1).The CpDNA lengths of these nine varieties ranged from 149,474 bp (D. gratianopolitanus dg2868) to 149,735 bp (D. gratianopolitanus dg1602), with minimal variation.Moreover, the GC content of the nine CpDNAs ranged from 36.29 to 36.33%, with the IR regions showing relatively high GC content.D. gratianopolitanus (dg1869) had 124 genes, whereas the other eight varieties had 125 genes.All D. gratianopolitanus varieties contained 84 protein-encoding genes.It is worth mentioning that although the IR region of D. caryophyllus was the shortest in all 12 CpDNA, it exhibited the highest GC content.D. caryophyllus had the lowest number of proteincoding genes, totaling 83, and lacked the rpl2 gene.Conversely, the largest number of genes was observed for D. longicalyx, with 126 genes.

Table 2 :
List of genes encoded by D. caryophyllus chloroplast genome