Functional divergence and adaptive selection of KNOX gene family in plants

Abstract KNOTTED-like homeodomain (KNOX) genes are transcriptional regulators that play an important role in morphogenesis. In the present study, a comparative analysis was performed to investigate the molecular evolution of the characteristics of the KNOX gene family in 10 different plant species. We identified 129 KNOX gene family members, which were categorized into two subfamilies based on multiple sequence alignment and phylogenetic tree reconstruction. Several segmental duplication pairs were found, indicating that different species share a common expansion model. Functional divergence analysis identified the 15 and 52 amino acid sites with significant changes in evolutionary rates and amino acid physicochemical properties as functional divergence sites. Additional selection analysis showed that 14 amino acid sites underwent positive selection during evolution, and two groups of co-evolutionary amino acid sites were identified by Coevolution Analysis using Protein Sequences software. These sites could play critical roles in the molecular evolution of the KNOX gene family in these species. In addition, the expression profiles of KNOX duplicated genes demonstrated functional divergence. Taken together, these results provide novel insights into the structural and functional evolution of the KNOX gene family.

KNOX genes have been isolated from many plants, such as Nicotiana tabacum [10], Arabidopsis thaliana [11,12], Solanum lycopersicum [13], Medicago truncatula [14], and Physcomitrella patens [15]. In most monocots, the KNOX1 gene is expressed only in shoot apical meristems (SAM) and not in the primordium. In compound-leaf species, KNOX1 are expressed in both SAM and the leaf primordium [16], showing that they may play a significant role in maintaining diversity in leaf morphology [3]. The KNOX2 gene regulates the morphological transformation of haploid to diploid cells in terrestrial plants [17].
In A. thaliana, the class I subfamily includes STM (SHOOT MERISTEMLESS), KNAT1, KNAT2, and KNAT6 [18], and the class II subfamily includes KNAT3, KNAT4, KNAT5, and KNAT7 [3,19], which are widely distributed. STM and KNAT1 are used to establish and maintain SAM. Similarly, KNAT6 has previously been shown to function in the maintenance of borders during SAM and embryogenesis [20]. Furthermore, KNAT1 promotes inflorescence development, while KNAT2 regulates flower type [8,18,21]. The class I gene STM regulates the development of the plant meristem in Arabidopsis [8], and regulation of gene expression leads to the petal spurs rapidly evolving in Antirrhinum [22]. In summary, the class I KNOX gene is involved in the morphogenesis of lateral organs and maintains the function of SAM and the diversity of leaf morphology [3,22].
Meanwhile, the KNOX class homeobox genes Oskn2 and Oskn3 in rice are both expressed in the tissues of the SAM and participate in the regulation of SAM formation. For instance, class II KNOX genes, such as KNAT3, KNAT4, and KNAT5, contribute to the differentiation of tissues in organs in Arabidopsis [9,23,24]. The regulatory network within which KNAT7 functions contributes to the negative regulation of Arabidopsis and Populus secondary cell wall biosynthesis [24,25]. The class II subfamily lacks phenotypic due to mutations; however, there have been relatively few previous studies. In brief, KNOX genes are involved in the growth and development of different tissues and organs in different species [26][27][28]. Plants must constantly adjust their physiological processes to adapt to changes in the external environment [29]. TFs are considered to be key targets for studying the molecular mechanisms of abiotic stress response because they, either alone or collectively, regulate the expression of many downstream target genes [30].
In the present study, we identified KNOX genes in different species and classified them by reconstructing phylogenetic trees. Then, we identified the critical amino acid sites responsible for functional divergence, positive selection, and co-evolution. Together with expression profiles, we present some insights into the molecular evolution of the KNOX gene family, which can be useful for future research on the functions of these genes.

Identification of plant KNOX gene family
Genes from the plant KNOX gene family were identified from 10 species that represented monocotyledonous, dicotyledonous, and bryophyte plants. The KNOX gene family members from the Arabidopsis genome were obtained from the TAIR database (http://www. arabidopsis.org/) and then BLAST searched as seed sequences in the Phytozome database (http://www. phytozome.org) to obtain homologous sequences from nine other species (Glycine max, Populus trichocarpa, Gossypium raimondii, Solanum lycopersicum, Oryza sativa, Brachypodium distachyon, Sorghum bicolor, Zea mays, and Physcomitrella patens). If the E value of the sequence was ≤1 × 10 −5 , then it was listed as a candidate sequence. The Pfam (http://pfam.xfam.org) and SMART (http://smart.embl-heidelberg.de/) online tools were used to determine whether the candidate sequence contained the KNOX1, KNOX2, ELK, and HD to ensure that the sequence domain was intact and used for the next analysis. In addition, coding sequences, protein sequences, and genomic sequences of KNOX family members were downloaded from the Phytozome database. The physicochemical properties of the KNOX gene family were obtained from the ExPASy database (https://www.expasy.org/), including the amino acid number, isoelectric point (PI), and molecular weight (MW) of the protein [31].

Phylogenetic tree construction
Multiple sequence alignment of the sequences from KNOX family members was performed with the MUSCLE program [32,33]. Three methods were used to construct the phylogenetic tree: Bayesian phylogenetic trees in MrBayes 3.2.5 [34] and neighbor-joining (NJ) and maximum-likelihood (ML) trees in MEGA 7.0 [35]. The reliability of interior branches was assessed with 1,000 bootstrap samples [35].

Exon-intron structure and motif analysis
The exon-intron structure was analyzed using the online tool GSDS (http://gsds.cbi.pku.edu.cn/) with the coding sequences (CDS) and genomic sequences of KNOX family members [36]. Conserved motifs of KNOX family members were identified using the online tool MEME (http://meme-suite.org/tools/meme) [37]. The maximum number of motifs = 10, and the remaining parameters were set to the default settings.

Functional divergence analysis
DIVERGE 3.0 was used to detect the functional divergence between clusters of the KNOX gene family [46]. The extent of divergence can be measured using the type I (site-specific altered selective constraints) and type II (radical shift in amino acid physiochemical properties) functional divergence coefficients (θI and θII) between subfamilies [47][48][49]. Moreover, Bayesian posterior probability (Q k ) can detect specific amino acid sites where functional divergence has occurred. In our study, the threshold of Q k was set to 0.9.

Positive selection analysis
Positive selection was investigated using the maximum likelihood approach in the CODEML procedure in PAML [50,51].
Site models, including null models (M0 and M3) and alternative hypothesis models (M7 and M8), were implemented in this program. A detailed description of the positive selection site test method can be found in Wang et al. [52].

Coevolution analysis
Coevolution analysis using protein sequences (CAPS) was performed with PERL-based software [53]. A detailed description of the coevolution sites test method can be found in Song et al. [54].

Protein structure prediction
The 3D structure of the KNOX protein was predicted using online software PHYRE2 (http://www.sbg.bio.ic.ac.uk/ phyre2/html/page.cgi?id=index) [55]. We used protein sequences to construct the 3D structure of KNOX family member AT4G08150 and then to screen important amino acids sites that were labeled on the 3D structure.

Expression analysis of KNOX genes
RNA-Seq data were introduced to further analyze the expression of plant KNOX genes. The Arabidopsis eFP Browser (http://bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi) tool and the rice eFP Browser (http://www.bar.utoronto. ca/efprice/cgi-bin/efpWeb.cgi) tool were used to search data from Arabidopsis and rice, respectively. A heat map was generated using the TBtools program [56].

Identification of KNOX gene family
Nine KNOX genes of Arabidopsis were obtained from the TAIR database. The gene AT1G14760 was not analyzed because it contains only two KNOX domains, which belong to the KNATM class, that were not included in this analysis. Furthermore, 121 candidate KNOX gene sequences from nine other species were obtained through BLAST searches in the Phytozome database ( Table A1). The Pfam [57] and SMART [58] online tools were used to ensure the wholeness of the domains (KNOX1, KNOX2, ELK, and HD). The results showed that the domains were complete, and a total of 129 typical KNOX family members were identified. The relevant information groups, gene IDs, protein lengths, isoelectric point of the deduced polypeptides, and molecular weight of the KNOX genes are listed in Table A2. The average number of amino acid residues ranged from 215 to 636 (average 345), and the isoelectric point of the KNOX gene family ranged from 5.12 to 9.53 (average 6.62). Except for GRMZM2G433591, all other members were weakly acidic. The average molecular mass of the KNOX family ranged from 28

Phylogenetic analysis of the KNOX gene family
To investigate the phylogenetic relationships of KNOX genes in 10 plant species (Arabidopsis thaliana, Glycine max, Populus trichocarpa, Gossypium raimondii, Solanum lycopersicum, Oryza sativa, Brachypodium distachyon, Sorghum bicolor, Zea mays and Physcomitrella patens), we used the MUSCLE software [32,33] to perform multiple sequence alignments of 129 protein sequences and then used three methods to construct phylogenetic trees: neighbor-joining ( Figure A1), maximum-likelihood (data not displayed), and Bayesian inference [34,35] (Figure 1a). According to the results, the topology of the three methods was consistent; the subsequent study uses the Bayesian phylogenetic tree. The 129 homeobox genes from 10 species, including monocots and dicots and P. patens, were divided into two subfamilies: class I and class II [59]. The sequence analysis and expression pattern analysis supported this result, which was also consistent with other previous research [9]. The class I subfamily includes 80 members, while the class II subfamily includes 49 members. This difference may be due to the method of gene amplification, which leads to the difference in the number of subfamily members. Figure 1b shows the exon-intron structure of KNOX gene family members analyzed by the GSDS online system [60]. Most members of the subfamily contained five exons. The number of exons was conservative within the subfamily, and there was no significant difference in the number of exons in the same subfamily. Of the class I members, 78.75% contained five exons, whereas 13 members contained four exons (Figure 1b), and 91.84% of the members of the class II subfamily contained five exons. The number of exons in P. patens was significantly different. For example, Ppls154_83V6 contained 10 exons, indicating that exons may have been lost during evolution to adapt to the environment.
The conserved domains of the KNOX gene family were analyzed using the MEME online tool [37], and 10 conserved motifs were obtained that were named motif1-motif10. The structure and order of these motifs in the two subfamilies are shown in Figure 1c. Motifs 8, 4, 3, 2, and 1 were widely distributed among species in that order, except for P. patens, which did not contain motif 8. However, there were also slight differences between different members of the same plant. Most members of the class I subfamily contained motifs 8, 4, 3, 6, 2, and 1 arranged in that order, and most members of the class II subfamily contained motifs 8, 4, 10, 3, 5, 9, 7, 2, and 1 arranged in that order. Differences in motifs may be an important cause of functional divergence in the two subfamilies.

Expansion analysis of plant KNOX gene family
Gene duplication is a major driving force of adaptive evolution in species [61]. In this study, we investigated the gene duplication mode of the KNOX gene family and mainly studied tandem duplication and segmental duplication. Tandem duplication gene pairs were detected in only three species; all of which were members of the monocotyledonous of the class I subfamily. Furthermore, segmental duplication genes were clearly detected in eight species (Table A3). Of the segmental duplication gene pairs, 93% were detected in dicotyledons, whereas only four KNOX segmental duplication gene pairs were detected in monocotyledons. We found that the class I subfamily contained both segmental and tandem duplication genes, which may explain the higher number of genes in class I than class II. In dicotyledonous plants, genes are mainly amplified through segmental duplication; in monocotyledonous plants, tandem duplication and segmental duplication coexist. To estimate the approximate time of segmental duplication events, the base synonymous mutation rate (K s -values) was used [38] ( Table A3). The results showed that the segmental duplication events of most species were consistent with the large-scale duplication events, and segmental duplications were preserved after genome duplication.

Functional divergence analysis of KNOX gene family
To determine the difference in the evolutionary rates and physicochemical properties of amino acid sites, the type I and II functional divergence of the two subfamilies was estimated using DIVERGE [48,62]. Key amino acid sites for functional divergence were determined based on posterior probability (Q k ). The results in Table 1 show that the divergence coefficients of type I of the two subfamilies were significant (θI = 0.442 ± 0.052; LRT = 71.696; P < 0.01), indicating that the amino acid sites between the two subfamilies have different evolutionary rates. Meanwhile, the type II coefficients of the two subfamilies were also significant (θII = 0.106 ± 0.186; P < 0.01), indicating the possible presence of type II divergence sites during evolution between the two subfamilies. Furthermore, the amino acid sites were analyzed between groups under stringent conditions (Q k > 0.9) to confirm the amino acid sites where functional divergence had occurred [48].
The results identified 15 sites with a high probability of being associated with type I functional divergence. There were 52 type II functional divergence sites ( Table 2), more than twice the sites identified for type I, of which eight points (140A, 155Q, 170A, 223I, 224R, 283H, 286K, and 345Q) occurred in both type I and type II functional divergence, indicating that they underwent changes in evolutionary rates and physicochemical properties simultaneously. Therefore, these sites are expected to play an important role in functional differences during evolution. Apart from this, the number of type I and type II functional divergence sites was different, and more critical amino acid sites were identified as type II functional divergence within each subfamily. Hence, the functional divergence between genes of the two subfamilies was attributed primarily to rapid changes in amino acid physiochemical properties, followed by a shift in evolutionary rates.

Positive selection and co-evolution in
KNOX gene family The site model was selected to determine the selection pressure on different amino acid codon sites [51]. The results are shown in Table 3. The selection pressure was significantly different between M0 (one-ratio) and M3 (discrete; P < 0.01). The M3 model was better than the M0 model, indicating that different sites experience different selection pressures. The 2Δ ln L of M7 (beta) vs. M8 (beta & ω > 1) was 5795.66, the likelihood ratio test result was extremely significant (df = 2, P < 0.01), and the M8 model had an ω value of 2.63459, much >1, indicating that 14 amino acid positions were strongly affected by positive selection. Table 3 shows the positive selection sites with a posterior probability >95%. Among them, 143H, 171R, and 228S were significant positive selection sites, and 130D, 133A, 134M, 140A, 149Q, 165D, 172Q, 232M, 315K, 318T, and 322L were extremely significant positive selection sites. We used CAPS, which is significantly more sensitive than other methods, to analyze coevolved amino acid residues in the KNOX gene family [53]. We found two groups of coevolved sites: 248S and 249D, and 382L and 383Y. All sites were labeled according to their 3D structure to further investigate their interdependence (Figure 3). Note: θI and θII, the coefficients of type-I and type-II functional divergence between class I and class II. LRT: likelihood ratio test. **P < 0.01, highly significant. Q k : posterior probability. 3.6 Three-dimensional structure prediction and critical amino acid site identification of plant KNOX proteins We used PHYRE2 to predict the 3D structure of the KNOX family member AT4G08150 [55,63]. The critical amino acid sites were displayed by the multiple sequence alignment and 3D structure (Figures 2 and 3). These 14 sites were mainly dispersed on the KNOX1 domain, two positive selection sites were distributed on the KNOX2 domain, and three positive selection sites were distributed on the HD first alpha helix. The results indicated that the KNOX1 domain was more susceptible to positive selection pressure during the evolution of the KNOX gene family. Amino acid position 140A has undergone both functional divergence and positive selection and   The amino acid sites of type I and II functional divergences, positive selection, and co-evolution are labeled, respectively, with blue circles, red circles, green triangles, and black boxes. Blue and green frames indicate the first α-helix and PYP loop, respectively.
was located in the KNOX1 domain and at the C-terminus of motif 8 ( Figure 2). The two pairs of co-evolutionary sites we detected were marked on the 3D structure ( Figure 3). We found that two sets of positive selection sites were located on the C-terminal non-functional domain and were close to each other, showing that they may play a certain role in maintaining the spatial structural stability of KNOX proteins.

Expression analysis of KNOX gene family
To investigate the expression patterns of homologous KNOX genes in subgroups involved in plant growth and development, a heat map was constructed using TBtools (Figures 4 and 5). Members of the same subfamily exhibited similar transcription abundance profiles; however, there were also members that had similar expression profiles but unique phylogenies, such as AT1G62990. It should be noted that AT4G08150 and AT1G70510 belong to the same subfamily (class I). They are expressed at high levels in the pedicels, hypocotyls, and stem but at lower levels in cotyledons and leaves. From the overall expression level, the higher expression levels of AT1G23380 and AT1G62360 in shoots may be related to their indispensability for the formation and maintenance of SAM [29]. These results suggested that members in the same subfamily may play similar roles in the same organization. AT5G11060, AT5G25220, and AT4G32040 are from the class II subfamily. Their overall transcription was richer than that of the class I subfamily. The expression level was higher in senescing leaves. AT4G32040 was highly expressed in dry seeds. AT5G11060 was more highly expressed in leaves and different stages of flowers (Figure 4). LOC_Os02g08544 and LOC_Os06g43860 are also from the class II subfamily, and both of them were highly expressed in all tissues and organs ( Figure 5). Three members, LOC_Os02g08544, LOC_Os06g43860, and LOC_Os08g19650, were highly expressed in mature and young leaves, whereas the other members exhibited relatively low expression levels. In addition, the expression of LOC_Os03g51690, LOC_Os07g03770, and LOC_Os05g03884 was higher in the seeds but lower in the leaves, which indicates that sub-functionalization had occurred.

Genomic analysis of the KNOX gene family
In the present study, we isolated 129 candidate KNOX gene sequences after removing incomplete and redundant sequences from 10 different species. There were nine from Arabidopsis [64] and eight from Solanum lycopersicum [65], which are consistent with the results of previous studies. Genome-wide analysis showed that the KNOX gene family was divided into two subfamilies: class I and class II (Figure 1). Both subfamilies contain monocots, dicots, and P. patens. Class I subfamily KNOX genes are similar to zmkn1 and are mainly expressed on the SAM of monocots and dicots [3,66]. According to previous studies, only one KNOX gene had evolved before the emergence of terrestrial plants, indicating that KNOX genes originated during the divergence of the last common ancestor of moss and vascular plants.
KNOX genes are divided into four domains: KNOX1, KNOX2, ELK, and HD [64,67]. The KNOX1 domain has negative regulatory effects on the transcription of target genes. The KNOX2 domain mediates the interaction between KNOX and members of the BELL gene family. The HD consists of three helices and is conserved in eukaryotes and is involved in DNA binding [9,68]. Members in the same subfamily contain similar numbers of exons and introns, except for the number of exons in P. patens, which may also be due to the absence of exons for functional adaptation during evolution. Intriguingly, most members of the KNOX gene family contained five identical conserved motifs, except that P. patens does not contain motif 8. A high degree of sequence identity and similar exon-intron structures of KNOX genes across families suggests that the KNOX family has undergone gene duplication events throughout evolution.
Although repeated genes may have evolved few novel functions, they play an important role in the origin of species and the evolution of biological functions [42,69]. Gene duplication plays a significant role not only in the process of genome rearrangement and expansion but also in the diversification of gene functions and the large number of gene families [70]. Segmental duplication, tandem duplication, and transposition events, such as retro and replicative transposition, are the three main forces that drive the expansion of gene families [61,70]. Transposition events are difficult to identify based on sequence analysis alone; therefore, we focused on segmental and tandem duplication events. The results of the present study indicated that tandem duplication was detected in three species from the class I subfamily, indicating that the genes produced by tandem duplication did not undergo functional divergence (Table A3). Segmental duplication was detected in eight species from both subfamilies. Monocotyledonous plants had both tandem and segmental duplication, and dicotyledonous plants had only segmental duplication, especially, most of the KNOX genes in soybean and cotton. Therefore, the main amplification method of soybean and cotton is segmental duplication. Segmental duplication is likely to have played a pivotal role in KNOX gene expansion in dicots. In addition, the number of segmental duplication events in the two subfamilies was similar, and only segmental duplication occurred in the class II subfamily, which may be the reason for the larger number of class I subfamily members. Class I subfamily members are isolated from different angiosperms. They are expressed in meristems and not in differentiated tissues or organs that are related to maintaining the properties of meristems [9]. Large-scale duplication may have also been involved in the expansion of the KNOX gene family.

Functional divergence, positive selection, and co-evolution analysis
We selected the 3D structure of the Arabidopsis KNOX protein AT4G08150 for observation and marked the detected sites on the predicted 3D structure. The functional diversity between different subfamilies was mainly determined by specific amino acids in the subfamily, and the major reason for the functional divergence in repeated genes was possibly owing to the accumulation of amino acid mutation sites [42,71,72]. Type I and type II functional divergences between gene clusters of KNOX subfamilies were estimated by posterior probability analysis. By analyzing the functional divergence of the KNOX gene family, we identified a total of 15 type I functional divergence sites and 52 type II functional divergence sites from two subfamilies ( Table 2). This result indicates diversification in the evolutionary rates of specific amino acid sites or significant changes in the physicochemical properties of amino acids. There are far more type II than type I functional divergence sites, which indicated that type II functional divergence is dominant. Besides, eight amino acid sites were identified as both type I and type II functional divergence sites, indicating that they had undergone simultaneous shifts in evolutionary rates and physicochemical properties. However, the lack of significant differences in the degree of functional divergence between most subfamily pairs suggests that genes belonging to these subfamilies might perform similar functions. Positive selection has been associated with gene duplication and functional divergence. Here, we used computer simulations to evaluate the performance of Bayesian predictions for amino acids under positive selection [50]. However, the functions of most amino acid residues were conserved, and only a few amino acid sites can function in molecular adaptation [73]. In the present study, we detected 14 positive selection sites, three significant positive selection sites, and 11 extremely significant selection sites ( Table 3). The positive selection sites were mainly distributed in the KNOX1 domain, indicating that the KNOX gene family had been subjected to different selection pressures during evolution. Interestingly, we found that the site 140A experienced both type I and type II functional divergences and positive selection and may have an important role. The complexity of protein evolution is directly proportional to the potential function and structure of interactions between co-evolving sites within the molecule, and co-evolving amino acid sites interact between complex functional domains of a protein [74]. Detection of coevolving sites will provide important evidence for the study of the mechanisms underlying molecular evolution. A coevolution analysis of the KNOX gene family detected two sets of adjacent co-evolving sites: 248S and 249D, and 382L and 383Y, both of which were in the c-terminal domain. Based on the analysis of the 3D structure (Figure 3), the coevolving sites are closer in the 3D structure. Thus, the interaction of these sites may have a stabilizing effect on the spatial structure of KNOX proteins.

Expression analysis of KNOX gene family
Most KNOX gene families exhibited variable expression levels in different tissues and organs (Figures 4 and 5). Class I KNOX genes are mainly expressed in the SAM and the class II genes are more widely expressed [9], as can be seen in the heatmap, which shows members of the class II subfamily expressed in most tissues and organs. The expression of KNOX genes mainly in the shoot may be related to important role within the SAM [29]. For example, the expression of KNOX genes in the shoot was upregulated in Arabidopsis. In addition, during the evolution of angiosperms, the KNOX1 gene was involved in the control of leaf shape. The expression pattern of KNOX1 in the primordium of a leaf is highly related to the shape of the leaf. We suspect that the key amino acid sites in the KNOX1 domain may be related to the expression of KNOX1 [16]. The expression of genes in different tissues reflects the diversity of functions. In summary, expression profiles of KNOX family members are largely organ specific, indicating that KNOX genes are differentially expressed in different groups and that regulatory regions of KNOX genes may have diverged. Importantly, the results also demonstrate divergence in the expression of KNOX duplicated genes during evolution.

Conclusions
In the present study, a total of 129 KNOX family members were identified from 10 species through extensive analysis of gene families, which were divided into two subfamilies by phylogenetic analysis. Monocots and dicots were amplified differently. Both tandem and segmental duplication are found in monocotyledonous plants, whereas dicotyledonous plants only have segmental duplication. Gene replication provides the main driving force for adaptive evolution of species. The large proportion of type II functional divergence that occurred indicated that the mode of functional divergence for KNOX proteins mainly relates to changes in the physicochemical properties of amino acids. The sitespecific model analysis revealed that the KNOX gene family contains 14 positive selection sites, mainly located in the KNOX1 domain, which suffers from strong positive selection pressure. Two pairs of amino acid sites close to each other in 3D structure were identified by coevolutionary analysis, indicating that they may play a key role in the stability of KNOX protein structure and function. Furthermore, KNOX genes exhibited different expression profiles in different organs as well as different functions. Our study provides a deeper understanding of the structural and functional evolution of the KNOX gene family and provides a basis for further research on KNOX proteins. Figure A1: Neighbor-joining phylogenetic tree of the KNOX gene family. Two branches of different colors represent different subfamilies: blue represents class I KNOX genes, and red represents class II KNOX genes.