Concentration of inverted repeats along human DNA

Abstract This work aims to describe the observed enrichment of inverted repeats in the human genome; and to identify and describe, with detailed length profiles, the regions with significant and relevant enriched occurrence of inverted repeats. The enrichment is assessed and tested with a recently proposed measure (z-scores based measure). We simulate a genome using an order 7 Markov model trained with the data from the real genome. The simulated genome is used to establish the critical values which are used as decision thresholds to identify the regions with significant enriched concentrations. Several human genome regions are highly enriched in the occurrence of inverted repeats. This is observed in all the human chromosomes. The distribution of inverted repeat lengths varies along the genome. The majority of the regions with severely exaggerated enrichment contain mainly short length inverted repeats. There are also regions with regular peaks along the inverted repeats lengths distribution (periodic regularities) and other regions with exaggerated enrichment for long lengths (less frequent). However, adjacent regions tend to have similar distributions.


Introduction
The non-B DNA structures play important roles for biological processes (see for example, [1][2][3][4][5][6]).This work focus on the study of the lengths of potential hairpins/cruciforms non-B DNA structures.Various studies in the literature argue that cruciform structures are a common DNA feature important for regulating biological processes and that the genomes contain a remarkable number of inverted repeats in a non-random distribution (see [7] for a review).
Cruciform structures are formed by inverted repeats, which are composed by a single stranded sequence of nucleotides followed downstream by its reverse complement.There are several procedures and computational approaches to study inverted repeats and to describe the regional variation of inverted repeat lengths [8][9][10].
In this work we have developed a simulation scenario in order to highlight/identify the regions with exaggerated enrichment globally (of all lengths) and by lengths sections.We also present an exploratory analysis of all human chromosomes inverted repeat enrichment globally (in all genome) and in each chromosome separately.

Methods
This work uses the human genome (GRCh38 assembly sequences) and searches for features beyond the already well-known repetition structures published in the literature.Thus, the pre-masked sequences available from the UCSC Genome Browser webpage [11] with repeats reported by RepeatMasker [12] and Tandem Repeats Finder [13] masked with N symbols were used.The unknown or ambiguous nucleotides are usually coded with N symbols.All these ambiguous nucleotides are considered separators that split the sequence into a set of unambiguous subsequences.
Inverted repeats are nucleotide sequences that can form self complementary pairings between their two halves e.g.
CCTTACGnnnnnnCGTAAGG, where {A, C, G, T} represent the DNA alphabet and nnnnnn represent a sequence of nucleotides with a known length.
In this work, we analyse the distribution of the distances between reverse complement sequences with k = 7 nucleotides, which is similar to the distribution of the lengths of the inverted repeats of 7 nucleotides.We study this distribution along the genome by dividing the complete genome in successive windows containing 10 5 nucleotides.For all words of length k, we compute the frequency distributions of each distance, m(d), between occurrences of each word and all succeeding reversed complements at distances between k and 4000.

Measuring the concentration of inverted repeats
In order to evaluate the behaviour of the observed values of the m(d), inverted repeat cumulative frequencies are compared to the corresponding expected values obtained from a Markov chain reference model of order 7.We assumed a Markov model of order k to estimate p(d) since the occurrence of reversed complements cannot be considered independent.

Expected values under higher order
We use a z-score, as proposed recently [14], as a measure between the observed values and the expected values obtained from the Markov model, . ( In order to measure the concentration of inverted repeats for a set of successive distances we compute the sum of all T values between two bounds (d1 and d2) with T(d) a z-score adjusted to account for the effect of the presence of ambiguous symbols in the sequence [14].

Simulation study:
A control scenario simulation was developed and run to evaluate the results of the S measure under controlled conditions.Control scenario conditions: -24 sequences with the same size of each of the human chromosomes; -the same number and in the same positions of the ambiguous symbols (Ns) in human chromosomes; -the DNA sequences were generated by a 7-order Markovian model; -the probabilities of the words and the transition matrices of the Markov model, were estimated from each chromosome sequence.
We compute the critical values on the empirical distribution of the simulated genome, assuming a significance level of 5 %.The critical values are the 0.95 quantiles (cv) of the S values of all windows in each chromosome (or globally).Windows with S values surpassing the critical value are considered significantly enriched.

Data analysis
The data analysis is based on M(d) and S [d1,d2] .It is divided into 3 parts: -comparison of inverted repeats enrichment between chromosomes; -analysis of inverted repeats enrichment as a function of inverted repeats length; -analysis of the inverted repeats enrichment as a function of position along each chromosome.
Inverted repeat lengths were grouped into nine classes: I t = S [7,4000] , 3 Results and discussion

Comparison of inverted repeats enrichment between chromosomes
Figure 1 shows the boxplots of S [7,4000] for all chromosomes and all considered inverted repeat lengths (length class I t ), both for the human genome and the control scenario.The distributions of the S values in the human genome are reasonably similar for the various chromosomes: all distributions show positive mean, positive skew and similar dispersion.The distributions of the control scenario are clearly different from those of the human genome: all distributions have null mean, are symmetric and have a comparatively much lower dispersion.
Table 1 shows the percentage of windows that were considered to have significantly enriched concentration of inverted repeats in each chromosome and for each inverted repeat length class.The percentage of enriched windows for all considered inverted repeats lengths (class I t ) in the complete genome is quite large (66.3 %).This confirms the strong positive skew of the S distribution previously observed.

Analysis of inverted repeats enrichment as a function of inverted repeats length
Table 1 also shows the percentage of windows with significantly enriched concentration of inverted repeats in each chromosome, for each length class (I 1 through I 8 ).The percentage decreases with the increase of inverted repeat length.
Figure 2 shows the boxplots of S values for each length class in the human genome and in the control scenario.
The human genome reveals some regions with very high and significant enrichment in all length classes.The control scenario shows that the dispersion of S decreases with the increase of inverted repeats length.This reveals that the S measure is sensitive to the inverted repeat length.This limitation of the measure does not compromise our analysis, since critical values were obtained from the control scenario in each length class.
Figure 3 shows the variations of S in the different length classes along chromosome X.
Windows with similar enrichment seem to form regional clusters along the genome.

Analysis of the inverted repeats enrichment as a function of position along each chromosome
Figure 4 shows the absolute frequencies of occurrence of each inverted repeat length in three different windows of chromosome X.The top/bottom plots pertain to the windows with the highest/lowest S [7,4000] values in that chromosome.The middle plot pertains to the window with the highest S [2001,2500] values.The Supplementary Material contains the absolute frequency plots for windows selected according to the same criteria in every chromosome.
The selected windows display very different distributions.The distributions on the top and middle plots both present frequency values much higher than the expected values.The bottom plot represents a rare example of a window with negative S values.The periodic regularities seen in the middle plot were previously identified and studied in [9].
Table 1: Percentage of windows with significantly enriched concentration of inverted repeats for each length class.Column "wins" shows the total number of windows in each chromosome.Row "Global %" shows the percentages for the complete genome.Row "Mean cv" shows the weighted mean of the critical values.

Conclusions
The analysis carried out in this work revealed several human genome regions with highly enriched occurrence of inverted repeats in all human chromosomes.The enrichment in inverted repeats concentration is not uniform along the genome and it depends on the repeats lengths, being more prominent for short lengths.
Even though we removed well known repetitive sequences, we still found regions with atypically enriched concentration of inverted repeats.Further studies for understanding the reasons for this phenomenon are needed, which may imply analysing the genomic word composition of these regions.
Markov chain for DNA sequences: Let M(d) be the random variable that represents the total number of inverted repeats occurrence at distance d in a genomic region of length L and n(d) the corresponding total number of possible word pairs at distance d, where d = k, k + 1, … , 4000 and d = k means that the two words (of length k) are in adjacent positions.Let p(d) be the probability of occurrence of inverted repeats at distance d.If we assume the independence between trials, M(d) follows a binomial distribution, M(d) ⌢ B(n(d), p(d)) with expected value n(d)p(d) and standard deviation √ n(d) p(d)(1 − p(d)).

Figure 3 :
Figure 3: Heatmap of the S values for the inverted repeats length classes I 1 , I 2 , … , I 8 in chromosome X.Black shows enrichment and red shows reduction of the frequency of inverted repeats.