A cross-sectional overview of SARS-CoV-2 genome variations in Turkey

Introduction: Nearly a year following the emergence of COVID-19 in Turkey, we analysed SARS-CoV-2 sequences to identify virus genome variations and their probable impact on epidemiology, immune response and clinical disease. Materials and Methods: Complete genomes and partial Spike (S) region sequences originating from Turkey were accessed from the Global Initiative on Sharing Avian Inuenza Data (GISAID) database. The genomes were aligned and analysed for variations and recombinations using appropriate softwares. Results: 410 complete genomes and 206 S region sequences were included. Overall, 1200 distinct nucleotide variations were noted. Mean variation count was noted as 14.2 per genome and increased signicantly during the course of the pandemic. The most frequent variations were identied as A23403G (D614G; 92.9,%), C14408T (P323L, 92.2%), C3037T (89.8%), C241T (83.4%) and GGG28881AAC (RG203KR, 62.6%). The A23403G mutation was the most frequent variation in the S region sequences (99%). Majority of the genomes (%98.3) belonged in the SARS-CoV-2 haplogroup A. No evidence for recombination was identied in genomes representing sub-haplogroup branches. The variants of concern B.1.1.7, B.1.351 and P.1 were detected, with a statistically-signicant time-associated increase in the variant B.1.1.7 prevalence. Discussion: We described prominant SARS-CoV-2 variantions as well as comparisons with global virus diversity. Continuing a molecular surveillence in agreement with local disease epidemiology appears to be crucial, as vaccination and mitigation efforts are ongoing.


Introduction
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative agent of the pandemic of the twenty rst century, also called as Coronavirus Disease-2019 (COVID-19) [1,2]. Initially emerging as cases of atypical viral pneumonia in the Wuhan city, Hubei province of China, SARS-CoV-2 have readily spread around the globe, currently affecting millions of people in more than 220 countries.
More than a year after its o cial announcement, the control of the pandemic and mitigating its health, economic and social impact have become the foremost priority in all countries and international organizations.
SARS-CoV-2, classi ed in the Betacoronavirus genus of the family Coronaviridae, is an enveloped particle with a positive-stranded RNA as its genome [3]. The viral genome is approximately 29.9 kb in size and anked by untranslated regions (UTRs) in the 3' and 5' ends. Virus proteins comprising the structural components nucleocapsid (N), envelope (E), membrane (M) and spike (S), as well as several accesory and non-structural proteins (NSPs) are encoded by several open reading frames (ORFs) on the genome [4,5]. Variations occur spontaneously during SARS-CoV-2 replication but mutation frequency is relatively low compared to other RNA viruses, owing to the activity of nsp14 exoribonuclease [6,7] The exploration of SARS-CoV-2 genetic diversity in has been pivotal for control efforts since the beginning of the pandemic. Sequencing of the virus genome has provided critical information for the investigation of origin and spread of the infection as well as development of appropriate diagnostics. It further enables surveillance of virus molecular epidemiology and identi cation of variants that can undermine mitigation and vaccination efforts [8,9]. Aided by the widespread availability of next generation sequencing technologies and online sharing of genome information, a signi cant number of virus genomes originating from several countries affected by the pandemic has now been available. As the vaccination efforts are spreading in many countries, the monitorization of the virus genetic diversity will continue to be crucial to identify variants with enhanced transmissibility or clinical disease as well as potential escape mutants [9]. In this study, we aimed to evaluate SARS-CoV-2 genomic diversity from Turkey since the emergence of initial cases, to gain insights into origin, patterns of variation and their probable impact on epidemiology.
To incorporate recent data in the analysis, we further screened spike region sequences uploaded at a later date in the database and accessed a total of 206 non-redundant sequences, collected between 30.04.2020-03.02.2021 but mostly during January 2021 (170/206) (Supplement 1). Descriptive statistics, data distribution and correlations were analysed using Analyse-it v4.20.1 (Analyse-it Software Ltd. Leeds, United Kingdom).

Results
Among the 410 SARS-CoV-2 genomes included in the study, 261 (63%) originated from specimens collected during March-June 2020, 27 (6.58%) in July-October 2020 and 122 (29.75%) in December 2020 -January 2021. Age, gender, residence and other demographic or clinical features of infected cases were not evaluated due to the lack of su cient information in the database.
Complete genome sequencing revealed a total of 1200 different nucleotide variations ( Table 1). Majority of the variations were missense (661/1200, 55.1%) and silent mutations (441/1200, 36.7%). The variations were frequently located in ORF1a/1b region (684/1200, 57%), that encodes for the proteins and cofactors participating in viral replication; as well as in S region (153/1200, 12.5%), involved in virus attachment on the respiratory epithelium and the main target for neutralizing antibodies (Table 1).
Median variation count per genome was calculated as 12 (mean+standard deviation: 14.2+6.5, range: 4-36). Temporal distribution of the total variation count per genome according to sampling date revealed a positive correlation and statistically-signi cant increase during the course of the pandemic (Figure 1).

Discussion
In this study, we provide a cross-sectional overview and potential impact of the SARS-CoV-2 genome variations from Turkey. The genomes were accessed from the GISAID database, originated from specimens collected during a 10-month period in 2020 and 2021. We further included additional S region sequences, submitted after our initial database searches as an update, representing viruses circulating in early 2021. Therefore, the ndings are based on these datasets of 410 complete and 206 partial (spike) sequences, being so far the most comprehesive analysis performed in Turkey [23-25].
In the complete SARS-CoV-2 genomes, we identi ed 1200 individual nucleotide variations, with a median frequency of 12 (range:4-36) per genome (Table 1). Moreover, the temporal distribution of the variations indicated a statistically-signi cant accumulation of variations during the 10-month period examined ( Figure 1). Comparable ndings were reported in globally-distributed virus isolates, with more than 3000 speci c point mutations being detected and an increased frequency of variation during the course of the pandemic [8]. However, SARS-CoV-2 isolates from Turkey was proposed to exhibit an elevated variation rate in a study focusing on 166 virus genomes accessed during July 2020 [25], where frequently-detected variations C14408T and C18877T, affecting viral polimerase (nsp12) and exoribonuclease (nsp14), respectively; were suggested as a possible precipitating factors [27,28]. These variations were also noted in our study with varying rates (Table 2, Figure 2). In addition, the co-detection of C14408T and A23403G variations were suggested to be associated with increased diversity [26,27] In parallel with global isolates, the SARS-CoV-2 genome variations in Turkey are mostly missense or silent mutations, frequently involving the enzymes and co-factors, participating in replication in ORF1a/1b or the S regions of the virus genome [8, 17,26].
In the study, the most frequently-detected variations, namely the A23403G, C14408T and GGG28881AAC mutations resulting in amino acid substitutions in the corresponding virus proteins, were reported in previous analyses from Turkey [23][24][25]. However, they seem to be positively-selected in the local virus population pool, as their abundance seem to be elevated. For example, the A23403G variation was reported as low as 56.2% in previous reports, while it is detected in 92.9% of the complete genomes and 99% in S regions in this study (Figure 2, Figure 4). This observation is also evident in global genome data, where viruses with the A23403G and C14408T variations were steadily increased in frequency during the course of the pandemic and have become the majority in late 2020 [8]. The amino acid substitutions occuring as a result of these variations, namely P323L, D614G ve RG203KR, were also associated with a more severe COVID-19 clinical presentation [8]. Moreover, the D614G mutation, a de ning component of the variant of concern B.1.1.7, is also likely to affect immune responses to the S protein. In addition to the high frequency of this substitution in the study, we further identi ed other variations that might affect T and B cell epitopes, albeit with lower rates.
Throughout the pandemic, the availability of virus genome sequences and powerful online tools have enabled a nearly real-time monitorization of SARS-CoV-2 molecular epidemiology [8,17]. Previous reports on SARS-CoV-2 genetic diversity have described particular virus lineages and clades, mostly in overall agreement but lacking a uniforn nomenclature [8,29]. The size of accumulating sequence data further warrants more practical approaches to indicate phylogeographic relationships than standard phylogenetic reconstruction. Here, we adopted a previously-reported mutation-annotated reference strategy to describe intraspeci c phylogeny of SARS-CoV-2. We observed the majority of the SARS-CoV-2 genomes from Turkey to belong in the haplogroup A (98.3%), with main subhaplotype diversi cation into A2 (Figure 3). SARS-CoV-2 haplogroup A isolates consitute the ancestral node and predominant clade across the world. They are frequently-represented in isolates from Europe (97%) Africa (93%) and Asia (77%), but relatively scarce in South America (68%) and North America (53%) [8]. Among global haplogroup A subclades, A2 and A2a appears as the majority, with the phylogeographic inferences indicating a European origin. We observed haplogroup B is with a much lower frequency, and B4a subhaplotype representing the majority within this group (Figure 3). Haplogroup B viruses have been identi ed in all continents, with higher prevalence in North America (47%), South America (32%), Asia and Oceania (23%), with all major and minor subclades present in Asia [8]. The haplotype B viruses were introduced in Turkey likely by travel to endemic regions and further local spread was presumably prevented by isolation. Our analyses employing highly-diversi ed subclades of each main haplogroup failed to identify any evidence for recombination among local SARS-CoV-2 genomes. Overall, the ndings on virus genome diversity in Turkey suggest several introductions originating from multiple sources and subsequent local adaptation, also noted in previous reports using smaller datasets [23].
The emergence and rapid spread of SARS-CoV-2 variants has raised signi cant concern, due to their potential for enhanced transmissibility, altered clinical progression and escape from protective immune response induced by previous infection or widely-available vaccines [20]. Also dubbed as the variant of concern (VOC), these viruses exhibit a wide array of amino acid changes accumulated in several regions of the virus genome including the spike protein [20][21][22]. The rapid spread of particular VOCs in several countries during fall 2020 called for more stringent public health measures as well as targeted monitorization, which is also initiated and currently ongoing in Turkey. We detected three major VOCs in the study group, with increased prevalence of B.1.1.7 and B.1.351 in the recently-dated dataset (Table 3).
Moreover, the detection of P.1 in this group suggests not only an elevated prevalence but also a broader repertoire of variants in the population. These ndings justify the efforts to identify and monitor known and potentially-emerging virus variants. Particular limitations of this study need to be addressed. An important issue is the heterogeneity in temporal and spatial distribution of the samples employed for genome sequencing, which suggests a lack of organized sampling strategy for screening. In addition, missing demographic and location data in many instances also prevented further evaluations. Therefore, it is not possible to assess whether the current dataset fully represents the epidemiology and diversity in circulating viruses in Turkey. A continuous and organized surveillance strategy in conjunction with local transmission dynamics and infection epidemiology, will provide a better understanding of the SARS-CoV-2 molecular epidemiology in Turkey.
In conclusion, in this analysis of complete and partial SARS-CoV-2 genome sequences almost covering the rst year since emergence, we described main variations associated with epidemiology and immune response, with the observation of increased incidence of VOCs in Turkey. With the ongoing pandemic and accelerated vaccination campaigns, such investigations should be performed periodically for precise screening and coordination of control measures.

Declarations
Funding: No funding was received for the study Con icts of interest/Competing interests: The authors have no con icts of interest to declare with third parties Availability of data and material (data transparency): All data used in the manuscript are available online or provided in the submission Ethics approval: The study and its ndings are based on virus genome sequences obtained from online sources, therefore, no institutional ethics board approval was required nor sought. The study was approved by the COVID- 19 Figure 1 Temporal distribution of the total variation counts in the study. Spearman test was used to calculate correlation coe cient. Red lines indicate 95% con dence intervals. A time-related, statistically-signi cant increase total variation count per genome was observed (r=0.776, p<0.001).

Figure 2
Distribution of the frequently-observed variations in complete SARS-CoV-2 genome dataset.

Figure 3
Distribution of the SARS-CoV-2 haplogroups in the study group. Branch-delineating variations and genome frequencies are indicated.

Figure 4
Distribution of the frequently-observed variations in spike region dataset.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.