Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter November 25, 2014

Markovianness and conditional independence in annotated bacterial DNA

  • Andrew Hart EMAIL logo and Servet Martínez

Abstract

We explore the probabilistic structure of DNA in a number of bacterial genomes and conclude that a form of Markovianness is present at the boundaries between coding and non-coding regions, that is, the sequence of START and STOP codons annotated for the bacterial genome. This sequence is shown to satisfy a conditional independence property which allows its governing Markov chain to be uniquely identified from the abundances of START and STOP codons. Furthermore, we show that the annotated sequence of STARTs and STOPs complies with Chargaff’s second parity rule.

2010 MSC: 62G10; 62M07; 62P10; 92D20

Corresponding author: Andrew Hart, Centro de Modelamiento Matemático, Universidad de Chile, Beauchef 851, Santiago, Chile, e-mail:

Acknowledgments

This work was supported by the Center for Mathematical Modeling’s (CMM) CONICYT Basal program PFB 03, and INRIA-Chile’s Communication and Information Research and Innovation Center (CIRIC) Natural-Resource Management Program. The authors extend their thanks to the CMM Mathomics Laboratory for advice.

Appendices

A Bacterial genomes

For the sake of completeness and to support the main article text, these appendices contain additional data analyses and technical discussion, but this extra material is not essential for following the article.

The 13 bacterial DNA sequences we have examined are freely available in GenBank (via the NCBI ftp server). We present a list of the name, GenBank accession number and version of each sequence in Table 7. Table 8 displays the number of annotated START/STOP codons on the primary and complementary strands of each chromosome, as well as the total in the duplex as a whole.

Table 7

List of 13 bacterial chromosomes studied and their GenBank accession numbers.

ChromosomeAccession. Version
1Escherichia coli str. K-12 substr. MG1655NC_000913.2
2Helicobacter pylori 26695 chromosomeNC_000915.1
3Staphylococcus aureus subsp. aureus MRSA252 chromosomeNC_002952.2
4Leptospira interrogans serovar Lai str. 56601 chromosome INC_004342.2
5Leptospira interrogans serovar Lai str. 56601 chromosome IINC_004343.2
6Streptococcus pneumoniae ATCC 700669NC_011900.1
7Bacillus subtilis subsp. spizizenii str. W23 chromosomeNC_014479.1
8Vibrio cholerae O1 str. 2010EL-1786 chromosome 1NC_016445.1
9Vibrio cholerae O1 str. 2010EL-1786 chromosome 2NC_016446.1
10Propionibacterium acnes TypeIa2 P.acn33 chromosomeNC_016516.1
11Salmonella enterica subsp. enterica serovar Typhi str. P-stx-12NC_016832.1
12Yersinia pestis D182038 chromosomeNC_017160.1
13Mycobacterium tuberculosis 7199-99NC_020089.1
Table 8

Numbers of annotated START/STOP codons on the primary and complementary strands of 13 bacterial chromosomes.

ChromosomeNumber of START/STOP codons
Primary strandComplementary strandTotal
NC_000913405842848342
NC_000915152816063134
NC_002952256027305290
NC_004342362631926818
NC_004343320266586
NC_011900191020703980
NC_014479384442768120
NC_016445275028605610
NC_01644611629182080
NC_016516223422324466
NC_016832480645749380
NC_017160343038107240
NC_020089400639627968

The last column displays the total number of START/STOP codons annotated on the duplex.

B Measuring deviation from Markovianness

As mentioned at the end of Section 2, even though we have shown Markovianness of the annotated sequence of START and STOP codons, we can illustrate this phonomenon in a less rigourous way. We have found statistics which are sensitive to deviations from Markovianness in sequences of finite symbols. We begin by describing one of these measures and demonstrate it using simulated Markovian and non-Markovian data. Then, we shall compare this measure for annotated START/STOP codons in bacterial DNA sequences with the same measure applied to simulations of Markovian and non-Markovian sequences possessing similar statistical properties to those derived from the annotation data.

Let (Xt:t=0, 1, …) be a sequence of symbols in I, where I is the set of START/STOP codons for a bacterial genome.

If the sequence (Xt) has the Markov property, then

(8)(Xt+1=k|Xt=j,Xt1=i,Xt2=it2,,X0=i0)=(Xt+1=k|Xt=j), (8)

for all integers t>0 and symbols i0, …, it–2, i, j, kI. Multiplying both sides of (8) by P(Xt=j, Xt–1=i, Xt–2=it–2, …, X0=i0) and summing over i0, i1, …, it–2I, it can be seen that the Markov property implies

(Xt1=i,Xt=j,Xt+1=k)=(Xt+1=k|Xt=j,Xt1=i)(Xt=j,Xt1=i)=(Xt1=i,Xt=j)(Xt=j,Xt+1=k)(Xt=j).

Assume that (Xt) is started according to a stationary distribution π. Then the above quantity does not depend on t and we can write it in the more compact form

([ijk])=([ij])([jk])([j]),

where [i], [ij] and [ijk] denote the cylinder sets of length one, two and three symbols respectively. Therefore, when (Xt) is a Markovian sequence, M3(i, j, k)=0, for all i, j, kI, where

M3(i,j,k)=([ijk])([ij])([jk])([j]).

Note that the converse is not true, that is, M3(i, j, k)=0 for all i, j, kI does not imply that (Xt) is Markovian. We claim that in spite of this limitation it is still worthwhile considering M3(i, j, k). Firstly, since genes are generally very long sequences of codons, symbols in (Xt) separated by three or more lags, e.g., Xt and Xt–3, not only abut different coding sequences, but may correspond to distant loci in the original DNA sequence. Given that it is not totally unreasonable to suppose that a greater seperation between loci on a strand is generally accompanied by a reduction in dependence between them, it follows that the dependence between symbols in (Xt) is more likely to decrease as the distance that separates them increases. Secondly, we will also repeat the present analysis using a function of four consecutive symbols in (Xt) analogous to M3(i, j, k).

It is straightforward to estimate the quantities M3(i, j, k) for a sequence by counting the occurrences of single codons, pairs of codons and groups of three codons. If Ni, Nij and Nijk denote the numbers of i, ij and ijk, respectively, then M3(i, j, k) can be estimated by

M^3(i,j,k)=NijknNijNjknNj,

where n is the length of the sequence. For purposes of calculating Ni, Nij and Nijk, we treat the sequence (Xt) as circular so that ∑iINi=∑i,jINij=∑i,j,kINijk=n. This also means that Nij=∑kiNijk and Ni=∑jINij.

Now, M^3=(M^3(i,j,k):i,j,kI) is a collection of |I|3 values, which is the deviation by the corresponding cylinder [ijk] from Markovianness. Note that, because sequences of START/STOP codons alternate between START codons and STOP codons, many elements of M3 and M^3 will be zero. For example, all but the last bacterial sequence listed in Table 7 has three START codons {ATG, GTG, TTG} and 3 STOP codons {TAA, TAG, TGA}. The last bacteria, Yersinia pestis D182038, employs an extra START codon, CTG. Thus, |I|=6 in general and M^3 will have 216 elements, of which at least 162 will be zero. The mean of M^3 is

M^¯3=1ni,j,kINijkn1ni,j,kINijNjknNj=nn21ni,jINijNjnNj=1n1ni,jINijn=1n1n=0.

Now, the sample standard deviation of M^3 provides a statistic that is responsive to departures from Markovianness:

S3=σ(M^3)=1|I|31i,j,kI(M^3(i,j,k)M^¯3(i,j,k))2=1|I|31i,j,kIM^32(i,j,k).

This is because S3=0 if and only if M^3(i,j,k)=0 for all i, j, kI.

To illustrate this, consider Figure 2, which displays a kernel density estimate for M^3 in three cases. The first case shows the density of M^3 for the sequence of START/STOP codons annotated on the primary strand of the Escherichia coli K-12 genome. Let us denote this sequence by X1. The second case shows the density estimated for a sequence X2 of START/STOP codons simulated from a Markov chain using a transition matrix estimated from X1. The idea is that X1 and X2 be statistically the same for single codons and pairs of consecutive codons so that M^3 only highlights the kind of mechanism, Markovian or non-Markovian, driving the process. In the third case, a latent AR(2) process was simulated using the following scheme.

Figure 2 Empirical illustration of the efficacy of using the standard deviation as a measure of how far a sequence deviates from the Markov property. The top figure pertains to the START/STOP codons annotated on the primary strand of E. coli K-12. The deviations are marked on the x-axis while the curve represents a kernel density estimate of the deviations. The middle plot illustrates the same thing using a simulated Markov chain. The bottom plot was produced using non-Markovian simulations of latent AR(2) processes.
Figure 2

Empirical illustration of the efficacy of using the standard deviation as a measure of how far a sequence deviates from the Markov property. The top figure pertains to the START/STOP codons annotated on the primary strand of E. coli K-12. The deviations are marked on the x-axis while the curve represents a kernel density estimate of the deviations. The middle plot illustrates the same thing using a simulated Markov chain. The bottom plot was produced using non-Markovian simulations of latent AR(2) processes.

Let (Zt:t=0, 1, …) be an AR(2) process with autoregressive coefficients λ1 and λ2, that is:

Zt=λ1Zt1+λ2Zt2+εt,

where the innovations εt are independently and identically distributed normal random variables with mean 0 and variance σ2. The process Z is stationary if and only if the parameters satisfy the conditions

λ2>1,λ2+λ1<1 and λ2λ1<1.

Note that Z is a Markov chain if and only if λ2=0 and an i.i.d. process if and only if λ2=λ1=0.

Next, for a finite sequence (Zt:t=0, 1, …, n), let qZ(p) denote the quantile function, that is,

qZ(p)=max{z:(1nt=0n1Ztz)p}.

Finally, we define the stochastic process (Yt:t=0, 1, …, n). To do this, we require that the symbols in I are ordered in some way. The order does not matter, we merely need to be able to say for i, jI that either i comes before j or j comes before i. Let i denote the symbol in I that comes before all others in I. Then, the latent AR(2) process is then defined as

Yt={i_ifYtqZ(πi_),iif qZ(j<iπj)<ZtqZ(jiπj).

Due to the way Y has been constructed, π is its invariant state distribution. Also, Y will be Markovian if and only if Z is Markovian (equivalently, λ2=0).

To obtain a non-Markovian sequence, we simulated a sequence X3 of START and STOP codons from the latent AR(2) process described above. In order to obtain a non-Markovian sequence with the same distribution of symbols as X1, we set λ1=–0.2 and λ2=0.4, and estimated π from X1.

In Figure 2, the densities of the deviations M^3 for the sequences derived from E. coli and the Markov chain simulation are fairly similar. Their statistics S3 are also comparable. In contrast, the density of M^3 for the non-Markovian simulation has much longer tails and exhibits much greater dispersion. The x-axis of the third plot in the figure has been truncated to the interval [–0.005, 0.005] to maintain clarity and allow for easy comparison with the other two densities. All three graphs have been plotted on the same scale also for this reason. Prior to truncation, the density of M^3 for the non-Markovian sample spanned the interval [–0.019, 0.0327] and 14 data points are omitted by the truncation. The measure S3 for the simulated latent AR(2) process is almost an order of magnitude larger than it is for the other two cases.

Table 9 displays the value of S3 computed on the primary strand of the 13 bacterial DNA sequences. The second column shows S3 derived from genome annotation data. The third and fourth columns show the measure of deviation from Markovianness as applied to Markovian and non-Markovian sequences respectively simulated as described above. For each of these columns, a sequence of the same length as the annotated START/STOP codons was simulated 1000 times and the mean value of S3 over the 1000 replications is shown in the table. For the non-Markovian case, the autoregressive parameters λ1 and λ2 were selected uniformly at random from the set of values that give rise to a stationary AR(2) process for each simulation. It is quite evident that the values of S3 for the annotation data and the Markovian simulations are of the same order of magnitude while the non-Markovian simulations result in values of S3 that are from twice to an order of magnitude greater. The final column in the table shows the length of the sequence of START and STOP codons annotated for each of the DNA sequences. There appears to be no relationship between the length of the START/STOP codon sequence and any of the measures of deviation from Markovianness calculated, so S3 does not seem to be overly sensitive to this length. Performing the same analysis on the complementary strands yields similar results which are shown in Table 10.

Table 9

A measure of Markovianness based on a codon-triplet analysis applied to the annotated START and STOP codons on the primary strand sequences of bacterial genomes, together with Markovian and non-Markovian simulations.

ChromosomeGenomeMeasure fromNumber ofSTARTs and STOPs
Markovian simulationsNon-Markovian simulations
NC_0009130.0004830.0003790.0039774058
NC_0009150.0006170.0007980.0035901528
NC_0029520.0006540.0004870.0037312560
NC_0043420.0006950.0005270.0032303626
NC_0043430.0027310.0018050.004268320
NC_0119000.0008010.0005790.0037761910
NC_0144790.0003730.0004950.0035093844
NC_0164450.0007150.0005430.0035982750
NC_0164460.0005600.0008360.0037381162
NC_0165160.0008820.0005130.0039832234
NC_0168320.0014530.0004190.0036324806
NC_0171600.0005920.0005110.0035023430
NC_0200890.0004830.0004600.0024534006
Table 10

A measure of Markovianness based on a codon-triplet analysis applied to the annotated START and STOP codons on the complementary strand sequences of bacterial genomes, together with Markovian and non-Markovian simulations.

ChromosomeGenomeMeasure fromNumber of STARTs and STOPs
Markovian simulationsNon-Markovian simulations
NC_0009130.0007860.0003530.0041204284
NC_0009150.0013640.0007660.0037411606
NC_0029520.0007480.0004550.0038392730
NC_0043420.0007360.0005710.0033343192
NC_0043430.0016760.0019210.004303266
NC_0119000.0005870.0005880.0036352070
NC_0144790.0005210.0004680.0034604276
NC_0164450.0011210.0005260.0035522860
NC_0164460.0007500.0009760.003687918
NC_0165160.0009490.0005280.0039082232
NC_0168320.0007050.0004170.0035084574
NC_0171600.0006440.0004970.0036063810
NC_0200890.0005090.0004610.0025463962

As commented above, M3(i, j, k)=0 (equivalently, the standard deviation of M3) is necessary for (Xt) to be Markovian, but it does not imply Markovianness. This is because the Markov property holds for histories of arbitrary length, not merely lengths of one or two. In order to compensate in part for this weakness in our measure of deviation from Markovianness, we can also consider Markovianness in terms of codon quadruplets. In this case, [ijkl] is the cylinder set for quadruplet ijkl. We define

M4(i,j,k,l)=([ijkl])([ijk])([kl])([k]),i,j,k,lI.

Now, if (Xt) is Markovian, then M4(i, j, k, l)=0 for all i, j, k, lI. We note that for most of the bacteria we examined, the alternating nature of their sequences of START and STOP codons means that a minimum of 1134 elements of M4=(M4(i, j, k, l):i, j, k, lI) will be zero, regardless of whether or not the sequence of STARTs and STOPs is Markovian. In a manner similar to the case for M3, we can estimate M4(i, j, k, l) by

M^4(i,j,k,l)=NijklnNijkNklnNk,

where Nijkl is the number of times ijkl appears in (Xt), which is once again treated as circular. Furthermore, the mean M^¯4=0 and

S4=σ(M^4)=1|I|41i,j,k,lIM^4(i,j,k,l)2

constitutes a measure of deviation from Markovianness in terms of codon quadruplets analogously to S3 for codon triplets.

We repeated the experiments for codon triplets shown in Tables 9 and 10, but using S4 instead of S3 as the measure of deviation from Markovianness. The results for the primary strand of each chromosome are displayed in Table 11 while those for each complementary strand appear in Table 12. As was the case for the experiments shown in Tables 9 and 10, both sets of results are consistent and provide a way of visualising the Markovian nature of the sequence of START/STOP codons.

Table 11

A measure of Markovianness based on a codon-quadruplet analysis applied to the annotated START and STOP codons on the primary strand sequences of bacterial genomes, together with Markovian and non-Markovian simulations.

ChromosomeGenomeMeasure fromNumber of STARTs and STOPs
Markovian simulationsNon-Markovian simulations
NC_0009130.0001990.0001880.0013994058
NC_0009150.0004080.0004040.0012131528
NC_0029520.0003390.0002530.0012362560
NC_0043420.0003380.0002660.0010843626
NC_0043430.0011700.0009040.001566320
NC_0119000.0003800.0002800.0012681910
NC_0144790.0002490.0002510.0010543844
NC_0164450.0003220.0002700.0012082750
NC_0164460.0003970.0004250.0012281162
NC_0165160.0003250.0002710.0014272234
NC_0168320.0005130.0002060.0012184806
NC_0171600.0002970.0002570.0011013430
NC_0200890.0002400.0002120.0007064006
Table 12

A measure of Markovianness based on a codon-quadruplet analysis applied to the annotated START and STOP codons on the complementary strand sequences of bacterial genomes, together with Markovian and non-Markovian simulations.

ChromosomeGenomeMeasure fromNumber of STARTs and STOPs
Markovian simulationsNon-Markovian simulations
NC_0009130.0003070.0001790.0013724284
NC_0009150.0005000.0003780.0011571606
NC_0029520.0002590.0002430.0011952730
NC_0043420.0003690.0002870.0010673192
NC_0043430.0008570.0009730.001650266
NC_0119000.0003020.0002820.0012512070
NC_0144790.0002440.0002340.0010924276
NC_0164450.0004780.0002660.0011602860
NC_0164460.0004430.0004850.001274918
NC_0165160.0003980.0002740.0013672232
NC_0168320.0003950.0002160.0011654574
NC_0171600.0002740.0002440.0011823810
NC_0200890.0002200.0002130.0007313962

C Transition matrices for the annotated START/STOP sequences

Here we display the transition matrices estimated for the sequence of START/STOP codons on the primary and complementary strands of the 13 bacterial chromosomes studied.

1. Escherichia coli str. K-12 substr. MG1655, complete genome

In the estimated START/STOP transition matrices for the primary and complementary strands of E. Coli K-12, there appear two anomalous entries in the top-left corner of each matrix. Inspection of the annotation (NC_000913.2) available from GenBank reveals that the 603rd gene on the primary strand spans loci 1204594–1205365 relative to the 5′ end. It starts with GTG and finishes with an ATG codon.

Similarly, the two non-zero elements in the top-left corner of the transition matrix estimated for the complementary strand are explained by the 473rd gene on the complementary strand. This gene spans loci 1077648–1077866 relative to the 5′ end of the complementary strand. It starts with an ATG codon and finishes with a GTG codon.

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00050.00000.00000.64810.07840.2729
GTG0.00630.00000.00000.62660.08230.2848
TTG0.00000.00000.00000.63890.08330.2778
TAA0.89940.08310.01750.00000.00000.0000
TAG0.88750.09380.01870.00000.00000.0000
TGA0.92090.06120.01800.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00050.00000.65720.05560.2867
GTG0.00610.00000.00000.55210.09820.3436
TTG0.00000.00000.00000.54050.10810.3514
TAA0.91350.07070.01590.00000.00000.0000
TAG0.89840.07810.02340.00000.00000.0000
TGA0.89460.08630.01920.00000.00000.0000

2. Helicobacter pylori 26695 chromosome, complete genome

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.55630.16880.2749
GTG0.00000.00000.00000.47500.20000.3250
TTG0.00000.00000.00000.53230.19350.2742
TAA0.81060.11030.07910.00000.00000.0000
TAG0.81200.09770.09020.00000.00000.0000
TGA0.82240.09810.07940.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.58520.14930.2655
GTG0.00000.00000.00000.45830.19440.3472
TTG0.00000.00000.00000.50000.25000.2500
TAA0.83740.08790.07470.00000.00000.0000
TAG0.76920.10770.12310.00000.00000.0000
TGA0.83490.08260.08260.00000.00000.0000

3. Staphylococcus aureus subsp. aureus MRSA252 chromosome, complete

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.74860.14340.1080
GTG0.00000.00000.00000.71840.17480.1068
TTG0.00000.00000.00000.70230.12210.1756
TAA0.82190.07800.10010.00000.00000.0000
TAG0.78800.09240.11960.00000.00000.0000
TGA0.82310.08160.09520.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.72760.16010.1123
GTG0.00000.00000.00000.77670.12620.0971
TTG0.00000.00000.00000.64600.18580.1681
TAA0.84830.07480.07680.00000.00000.0000
TAG0.81650.08720.09630.00000.00000.0000
TGA0.83540.06330.10130.00000.00000.0000

4. Leptospira interrogans serovar Lai str. 56601 chromosome I

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.57170.12780.3004
GTG0.00000.00000.00000.63710.16940.1935
TTG0.00000.00000.00000.53380.16730.2989
TAA0.78920.06090.14990.00000.00000.0000
TAG0.75000.09680.15320.00000.00000.0000
TGA0.76460.06970.16570.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.58460.12300.2923
GTG0.00000.00000.00000.50000.19440.3056
TTG0.00000.00000.00000.57040.14800.2816
TAA0.76630.05980.17390.00000.00000.0000
TAG0.72990.08060.18960.00000.00000.0000
TGA0.75700.07740.16560.00000.00000.0000

5. Leptospira interrogans serovar Lai str. 56601 chromosome II

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.53720.14050.3223
GTG0.00000.00000.00000.60000.00000.4000
TTG0.00000.00000.00000.62070.10340.2759
TAA0.73030.05620.21350.00000.00000.0000
TAG0.85000.10000.05000.00000.00000.0000
TGA0.76470.05880.17650.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.61860.12370.2577
GTG0.00000.00000.00000.11110.22220.6667
TTG0.00000.00000.00000.66670.14810.1852
TAA0.70890.05060.24050.00000.00000.0000
TAG0.66670.11110.22220.00000.00000.0000
TGA0.80560.08330.11110.00000.00000.0000

6. Streptococcus pneumoniae ATCC 700669, complete genome

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.63750.20710.1554
GTG0.00000.00000.00000.66670.23080.1026
TTG0.00000.00000.00000.63830.23400.1277
TAA0.90490.04430.05080.00000.00000.0000
TAG0.92000.03000.05000.00000.00000.0000
TGA0.91720.04140.04140.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.59790.23490.1672
GTG0.00000.00000.00000.60000.24000.1600
TTG0.00000.00000.00000.67500.22500.1000
TAA0.91000.04180.04820.00000.00000.0000
TAG0.90120.07820.02060.00000.00000.0000
TGA0.94120.02940.02940.00000.00000.0000

7. Bacillus subtilis subsp. spizizenii str. W23 chromosome, complete

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.65080.12470.2244
GTG0.00000.00000.00000.60510.13850.2564
TTG0.00000.00000.00000.57140.13890.2897
TAA0.76020.10470.13500.00000.00000.0000
TAG0.76830.10570.12600.00000.00000.0000
TGA0.78630.09030.12330.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.63880.13700.2243
GTG0.00000.00000.00000.58700.19020.2228
TTG0.00000.00000.00000.60640.15960.2340
TAA0.77800.09130.13070.00000.00000.0000
TAG0.78640.08090.13270.00000.00000.0000
TGA0.79050.07470.13490.00000.00000.0000

8. Vibrio cholerae O1 str. 2010EL-1786 chromosome 1, complete

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.62910.16270.2083
GTG0.00000.00000.00000.49630.22960.2741
TTG0.00000.00000.00000.51280.23080.2564
TAA0.85080.09550.05370.00000.00000.0000
TAG0.81090.11340.07560.00000.00000.0000
TGA0.85620.09360.05020.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.64720.16890.1839
GTG0.00000.00000.00000.60000.21540.1846
TTG0.00000.00000.00000.51920.17310.3077
TAA0.83770.09160.07060.00000.00000.0000
TAG0.83870.10080.06050.00000.00000.0000
TGA0.82970.07970.09060.00000.00000.0000

9. Vibrio cholerae O1 str. 2010EL-1786 chromosome 2, complete

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.65270.17150.1757
GTG0.00000.00000.00000.57890.21050.2105
TTG0.00000.00000.00000.39130.26090.3478
TAA0.80720.10190.09090.00000.00000.0000
TAG0.84910.06600.08490.00000.00000.0000
TGA0.84820.11610.03570.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.58910.16800.2429
GTG0.00000.00000.00000.39130.21740.3913
TTG0.00000.00000.00000.26530.24490.4898
TAA0.84800.06400.08800.00000.00000.0000
TAG0.85370.03660.10980.00000.00000.0000
TGA0.82680.03150.14170.00000.00000.0000

10. Propionibacterium acnes TypeIA2 P.acn33 chromosome, complete

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.10580.05790.8363
GTG0.00000.00000.00000.11990.09590.7842
TTG0.00000.00000.00000.25810.03230.7097
TAA0.68500.29130.02360.00000.00000.0000
TAG0.54670.40000.05330.00000.00000.0000
TGA0.72790.24590.02620.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.10980.06000.8301
GTG0.00000.00000.00000.11740.07720.8054
TTG0.00000.00000.00000.11430.14290.7429
TAA0.68800.26400.04800.00000.00000.0000
TAG0.61330.33330.05330.00000.00000.0000
TGA0.71070.26200.02730.00000.00000.0000

11. Salmonella enterica subsp. enterica serovar Typhi str. P-stx-12

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.56930.10130.3294
GTG0.00000.00000.00000.48910.13970.3712
TTG0.00000.00000.00000.56490.09920.3359
TAA0.85400.08600.06000.00000.00000.0000
TAG0.84130.11510.04370.00000.00000.0000
TGA0.84660.10470.04860.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.60740.10170.2909
GTG0.00000.00000.00000.61440.09320.2924
TTG0.00000.00000.00000.55940.11890.3217
TAA0.83740.10550.05710.00000.00000.0000
TAG0.82830.09440.07730.00000.00000.0000
TGA0.82990.10150.06870.00000.00000.0000

12. Yersinia pestis D182038 chromosome, complete genome

Primary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.56310.14090.2960
GTG0.00000.00000.00000.51720.17240.3103
TTG0.00000.00000.00000.62300.13110.2459
TAA0.83830.08810.07360.00000.00000.0000
TAG0.78860.12200.08940.00000.00000.0000
TGA0.82540.11710.05750.00000.00000.0000
Complementary strand
ATGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.54330.15080.3059
GTG0.00000.00000.00000.52220.15560.3222
TTG0.00000.00000.00000.72180.07520.2030
TAA0.83790.08250.07960.00000.00000.0000
TAG0.84890.10430.04680.00000.00000.0000
TGA0.82520.11190.06290.00000.00000.0000

13. Mycobacterium tuberculosis 7199-99 complete genome

Primary strand
ATGCTGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.00000.15130.30590.5428
CTG0.00000.00000.00000.00000.00000.14290.8571
GTG0.00000.00000.00000.00000.16120.27010.5687
TTG0.00000.00000.00000.00000.20910.24550.5455
TAA0.62540.00320.32060.05080.00000.00000.0000
TAG0.58520.00690.35460.05340.00000.00000.0000
TGA0.61340.00180.32790.05690.00000.00000.0000
Complementary strand
ATGCTGGTGTTGTAATAGTGA
ATG0.00000.00000.00000.00000.16720.30450.5283
CTG0.00000.00000.00000.00000.12500.12500.7500
GTG0.00000.00000.00000.00000.13220.30980.5580
TTG0.00000.00000.00000.00000.13330.26670.6000
TAA0.64140.00660.30590.04610.00000.00000.0000
TAG0.60300.00660.33550.05480.00000.00000.0000
TGA0.59910.00190.35910.04000.00000.00000.0000

References

Albrecht-Buehler, G. (2006): “Asymptotically increasing compliance of genomes with Chargaff’s second parity rules through inversions and inverted transpositions,” PNAS, 103, 17828–17833.10.1073/pnas.0605553103Search in Google Scholar PubMed PubMed Central

Bouaynaya, N. and D. Schonfeld (2008): “Non-stationary analysis of coding and non-coding regions in nucleotide sequences,” IEEE J. Selected Topics in Signal Processing, 2, 357–364.10.1109/JSTSP.2008.923852Search in Google Scholar

Chargaff, E. (1950): “Chemical specificity of nucleic acids and mechanism of their enzymatic degradation,” Experientia, 6, 201–209.10.1007/BF02173653Search in Google Scholar PubMed

DeGroot, M. (1991): Probability and statistics, 3rd edition, Reading, MA: Addison-Wesley.Search in Google Scholar

Fedorov, F., S. Saxonov and W. Gilbert (2002): “Regularities of context-dependent codon bias in eukaryotic genes,” Nucleic Acids Res., 30, 1192–1197.Search in Google Scholar

Hart, A. and S. Martínez (2011): “Statistical testing of chargaff’s second parity rule in bacterial genome sequences,” Stoch. Models, 27, 1–46.Search in Google Scholar

Kullback, S. (1959): Information theory and statistics, New York: John Wiley and Sons.Search in Google Scholar

Kullback, S. and R. Leibler (1951): “On information and sufficiency,” Ann. Math. Stat., 22, 79–86.Search in Google Scholar

Levin, D., Y. Peres and E. Wilmer (2008): Markov chains and mixing times, Providence: Amer. Math. Soc.10.1090/mbk/058Search in Google Scholar

Li, W. and K. Kaneko (1992): “Long-range correlation and partial 1/f spectrum in a noncoding dna sequence,” Europhysics Letters, 17, 655.10.1209/0295-5075/17/7/014Search in Google Scholar

Ljung, G. and G. Box (1976): “On a measure of lack of fit in time series models,” Biometrika, 65, 297–303.10.1093/biomet/65.2.297Search in Google Scholar

Mitchell, D. and R. Bridge (2006): “A test of Chargaff’s second rule,” Biochem. Biophys. Res. Commun., 40, 90–94.Search in Google Scholar

Peng, C., S. Buldyrev, A. Goldberger, S. Havlin, F. Sciortino, H. Simons and E. Stanley (1992): “Long-range correlations in nucleotide sequences,” Nature, 356, 168–170.10.1038/356168a0Search in Google Scholar PubMed

Rudner, R., J. Karkas and E. Chargaff (1968): “Separation of B. subtilis DNA into complementary strands. III. direct analysis,” Proc Natl Acad Sci USA, 60, 921–922.10.1073/pnas.60.3.921Search in Google Scholar PubMed PubMed Central

Published Online: 2014-11-25
Published in Print: 2014-12-1

©2014 by De Gruyter

Downloaded on 28.3.2024 from https://www.degruyter.com/document/doi/10.1515/sagmb-2014-0002/html
Scroll to top button