Abstract
We explore the probabilistic structure of DNA in a number of bacterial genomes and conclude that a form of Markovianness is present at the boundaries between coding and non-coding regions, that is, the sequence of START and STOP codons annotated for the bacterial genome. This sequence is shown to satisfy a conditional independence property which allows its governing Markov chain to be uniquely identified from the abundances of START and STOP codons. Furthermore, we show that the annotated sequence of STARTs and STOPs complies with Chargaff’s second parity rule.
Acknowledgments
This work was supported by the Center for Mathematical Modeling’s (CMM) CONICYT Basal program PFB 03, and INRIA-Chile’s Communication and Information Research and Innovation Center (CIRIC) Natural-Resource Management Program. The authors extend their thanks to the CMM Mathomics Laboratory for advice.
Appendices
A Bacterial genomes
For the sake of completeness and to support the main article text, these appendices contain additional data analyses and technical discussion, but this extra material is not essential for following the article.
The 13 bacterial DNA sequences we have examined are freely available in GenBank (via the NCBI ftp server). We present a list of the name, GenBank accession number and version of each sequence in Table 7. Table 8 displays the number of annotated START/STOP codons on the primary and complementary strands of each chromosome, as well as the total in the duplex as a whole.
Chromosome | Accession. Version | |
---|---|---|
1 | Escherichia coli str. K-12 substr. MG1655 | NC_000913.2 |
2 | Helicobacter pylori 26695 chromosome | NC_000915.1 |
3 | Staphylococcus aureus subsp. aureus MRSA252 chromosome | NC_002952.2 |
4 | Leptospira interrogans serovar Lai str. 56601 chromosome I | NC_004342.2 |
5 | Leptospira interrogans serovar Lai str. 56601 chromosome II | NC_004343.2 |
6 | Streptococcus pneumoniae ATCC 700669 | NC_011900.1 |
7 | Bacillus subtilis subsp. spizizenii str. W23 chromosome | NC_014479.1 |
8 | Vibrio cholerae O1 str. 2010EL-1786 chromosome 1 | NC_016445.1 |
9 | Vibrio cholerae O1 str. 2010EL-1786 chromosome 2 | NC_016446.1 |
10 | Propionibacterium acnes TypeIa2 P.acn33 chromosome | NC_016516.1 |
11 | Salmonella enterica subsp. enterica serovar Typhi str. P-stx-12 | NC_016832.1 |
12 | Yersinia pestis D182038 chromosome | NC_017160.1 |
13 | Mycobacterium tuberculosis 7199-99 | NC_020089.1 |
Chromosome | Number of START/STOP codons | ||
---|---|---|---|
Primary strand | Complementary strand | Total | |
NC_000913 | 4058 | 4284 | 8342 |
NC_000915 | 1528 | 1606 | 3134 |
NC_002952 | 2560 | 2730 | 5290 |
NC_004342 | 3626 | 3192 | 6818 |
NC_004343 | 320 | 266 | 586 |
NC_011900 | 1910 | 2070 | 3980 |
NC_014479 | 3844 | 4276 | 8120 |
NC_016445 | 2750 | 2860 | 5610 |
NC_016446 | 1162 | 918 | 2080 |
NC_016516 | 2234 | 2232 | 4466 |
NC_016832 | 4806 | 4574 | 9380 |
NC_017160 | 3430 | 3810 | 7240 |
NC_020089 | 4006 | 3962 | 7968 |
The last column displays the total number of START/STOP codons annotated on the duplex.
B Measuring deviation from Markovianness
As mentioned at the end of Section 2, even though we have shown Markovianness of the annotated sequence of START and STOP codons, we can illustrate this phonomenon in a less rigourous way. We have found statistics which are sensitive to deviations from Markovianness in sequences of finite symbols. We begin by describing one of these measures and demonstrate it using simulated Markovian and non-Markovian data. Then, we shall compare this measure for annotated START/STOP codons in bacterial DNA sequences with the same measure applied to simulations of Markovian and non-Markovian sequences possessing similar statistical properties to those derived from the annotation data.
Let (Xt:t=0, 1, …) be a sequence of symbols in I, where I is the set of START/STOP codons for a bacterial genome.
If the sequence (Xt) has the Markov property, then
for all integers t>0 and symbols i0, …, it–2, i, j, k∈I. Multiplying both sides of (8) by P(Xt=j, Xt–1=i, Xt–2=it–2, …, X0=i0) and summing over i0, i1, …, it–2∈I, it can be seen that the Markov property implies
Assume that (Xt) is started according to a stationary distribution π. Then the above quantity does not depend on t and we can write it in the more compact form
where [i], [ij] and [ijk] denote the cylinder sets of length one, two and three symbols respectively. Therefore, when (Xt) is a Markovian sequence, M3(i, j, k)=0, for all i, j, k∈I, where
Note that the converse is not true, that is, M3(i, j, k)=0 for all i, j, k∈I does not imply that (Xt) is Markovian. We claim that in spite of this limitation it is still worthwhile considering M3(i, j, k). Firstly, since genes are generally very long sequences of codons, symbols in (Xt) separated by three or more lags, e.g., Xt and Xt–3, not only abut different coding sequences, but may correspond to distant loci in the original DNA sequence. Given that it is not totally unreasonable to suppose that a greater seperation between loci on a strand is generally accompanied by a reduction in dependence between them, it follows that the dependence between symbols in (Xt) is more likely to decrease as the distance that separates them increases. Secondly, we will also repeat the present analysis using a function of four consecutive symbols in (Xt) analogous to M3(i, j, k).
It is straightforward to estimate the quantities M3(i, j, k) for a sequence by counting the occurrences of single codons, pairs of codons and groups of three codons. If Ni, Nij and Nijk denote the numbers of i, ij and ijk, respectively, then M3(i, j, k) can be estimated by
where n is the length of the sequence. For purposes of calculating Ni, Nij and Nijk, we treat the sequence (Xt) as circular so that ∑i∈INi=∑i,j∈INij=∑i,j,k∈INijk=n. This also means that Nij=∑k∈iNijk and Ni=∑j∈INij.
Now,
Now, the sample standard deviation of
This is because S3=0 if and only if
To illustrate this, consider Figure 2, which displays a kernel density estimate for
Let (Zt:t=0, 1, …) be an AR(2) process with autoregressive coefficients λ1 and λ2, that is:
where the innovations εt are independently and identically distributed normal random variables with mean 0 and variance σ2. The process Z is stationary if and only if the parameters satisfy the conditions
Note that Z is a Markov chain if and only if λ2=0 and an i.i.d. process if and only if λ2=λ1=0.
Next, for a finite sequence (Zt:t=0, 1, …, n), let qZ(p) denote the quantile function, that is,
Finally, we define the stochastic process (Yt:t=0, 1, …, n). To do this, we require that the symbols in I are ordered in some way. The order does not matter, we merely need to be able to say for i, j∈I that either i comes before j or j comes before i. Let i denote the symbol in I that comes before all others in I. Then, the latent AR(2) process is then defined as
Due to the way Y has been constructed, π is its invariant state distribution. Also, Y will be Markovian if and only if Z is Markovian (equivalently, λ2=0).
To obtain a non-Markovian sequence, we simulated a sequence X3 of START and STOP codons from the latent AR(2) process described above. In order to obtain a non-Markovian sequence with the same distribution of symbols as X1, we set λ1=–0.2 and λ2=0.4, and estimated π from X1.
In Figure 2, the densities of the deviations
Table 9 displays the value of S3 computed on the primary strand of the 13 bacterial DNA sequences. The second column shows S3 derived from genome annotation data. The third and fourth columns show the measure of deviation from Markovianness as applied to Markovian and non-Markovian sequences respectively simulated as described above. For each of these columns, a sequence of the same length as the annotated START/STOP codons was simulated 1000 times and the mean value of S3 over the 1000 replications is shown in the table. For the non-Markovian case, the autoregressive parameters λ1 and λ2 were selected uniformly at random from the set of values that give rise to a stationary AR(2) process for each simulation. It is quite evident that the values of S3 for the annotation data and the Markovian simulations are of the same order of magnitude while the non-Markovian simulations result in values of S3 that are from twice to an order of magnitude greater. The final column in the table shows the length of the sequence of START and STOP codons annotated for each of the DNA sequences. There appears to be no relationship between the length of the START/STOP codon sequence and any of the measures of deviation from Markovianness calculated, so S3 does not seem to be overly sensitive to this length. Performing the same analysis on the complementary strands yields similar results which are shown in Table 10.
Chromosome | Genome | Measure from | Number ofSTARTs and STOPs | |
---|---|---|---|---|
Markovian simulations | Non-Markovian simulations | |||
NC_000913 | 0.000483 | 0.000379 | 0.003977 | 4058 |
NC_000915 | 0.000617 | 0.000798 | 0.003590 | 1528 |
NC_002952 | 0.000654 | 0.000487 | 0.003731 | 2560 |
NC_004342 | 0.000695 | 0.000527 | 0.003230 | 3626 |
NC_004343 | 0.002731 | 0.001805 | 0.004268 | 320 |
NC_011900 | 0.000801 | 0.000579 | 0.003776 | 1910 |
NC_014479 | 0.000373 | 0.000495 | 0.003509 | 3844 |
NC_016445 | 0.000715 | 0.000543 | 0.003598 | 2750 |
NC_016446 | 0.000560 | 0.000836 | 0.003738 | 1162 |
NC_016516 | 0.000882 | 0.000513 | 0.003983 | 2234 |
NC_016832 | 0.001453 | 0.000419 | 0.003632 | 4806 |
NC_017160 | 0.000592 | 0.000511 | 0.003502 | 3430 |
NC_020089 | 0.000483 | 0.000460 | 0.002453 | 4006 |
Chromosome | Genome | Measure from | Number of STARTs and STOPs | |
---|---|---|---|---|
Markovian simulations | Non-Markovian simulations | |||
NC_000913 | 0.000786 | 0.000353 | 0.004120 | 4284 |
NC_000915 | 0.001364 | 0.000766 | 0.003741 | 1606 |
NC_002952 | 0.000748 | 0.000455 | 0.003839 | 2730 |
NC_004342 | 0.000736 | 0.000571 | 0.003334 | 3192 |
NC_004343 | 0.001676 | 0.001921 | 0.004303 | 266 |
NC_011900 | 0.000587 | 0.000588 | 0.003635 | 2070 |
NC_014479 | 0.000521 | 0.000468 | 0.003460 | 4276 |
NC_016445 | 0.001121 | 0.000526 | 0.003552 | 2860 |
NC_016446 | 0.000750 | 0.000976 | 0.003687 | 918 |
NC_016516 | 0.000949 | 0.000528 | 0.003908 | 2232 |
NC_016832 | 0.000705 | 0.000417 | 0.003508 | 4574 |
NC_017160 | 0.000644 | 0.000497 | 0.003606 | 3810 |
NC_020089 | 0.000509 | 0.000461 | 0.002546 | 3962 |
As commented above, M3(i, j, k)=0 (equivalently, the standard deviation of M3) is necessary for (Xt) to be Markovian, but it does not imply Markovianness. This is because the Markov property holds for histories of arbitrary length, not merely lengths of one or two. In order to compensate in part for this weakness in our measure of deviation from Markovianness, we can also consider Markovianness in terms of codon quadruplets. In this case, [ijkl] is the cylinder set for quadruplet ijkl. We define
Now, if (Xt) is Markovian, then M4(i, j, k, l)=0 for all i, j, k, l∈I. We note that for most of the bacteria we examined, the alternating nature of their sequences of START and STOP codons means that a minimum of 1134 elements of M4=(M4(i, j, k, l):i, j, k, l∈I) will be zero, regardless of whether or not the sequence of STARTs and STOPs is Markovian. In a manner similar to the case for M3, we can estimate M4(i, j, k, l) by
where Nijkl is the number of times ijkl appears in (Xt), which is once again treated as circular. Furthermore, the mean
constitutes a measure of deviation from Markovianness in terms of codon quadruplets analogously to S3 for codon triplets.
We repeated the experiments for codon triplets shown in Tables 9 and 10, but using S4 instead of S3 as the measure of deviation from Markovianness. The results for the primary strand of each chromosome are displayed in Table 11 while those for each complementary strand appear in Table 12. As was the case for the experiments shown in Tables 9 and 10, both sets of results are consistent and provide a way of visualising the Markovian nature of the sequence of START/STOP codons.
Chromosome | Genome | Measure from | Number of STARTs and STOPs | |
---|---|---|---|---|
Markovian simulations | Non-Markovian simulations | |||
NC_000913 | 0.000199 | 0.000188 | 0.001399 | 4058 |
NC_000915 | 0.000408 | 0.000404 | 0.001213 | 1528 |
NC_002952 | 0.000339 | 0.000253 | 0.001236 | 2560 |
NC_004342 | 0.000338 | 0.000266 | 0.001084 | 3626 |
NC_004343 | 0.001170 | 0.000904 | 0.001566 | 320 |
NC_011900 | 0.000380 | 0.000280 | 0.001268 | 1910 |
NC_014479 | 0.000249 | 0.000251 | 0.001054 | 3844 |
NC_016445 | 0.000322 | 0.000270 | 0.001208 | 2750 |
NC_016446 | 0.000397 | 0.000425 | 0.001228 | 1162 |
NC_016516 | 0.000325 | 0.000271 | 0.001427 | 2234 |
NC_016832 | 0.000513 | 0.000206 | 0.001218 | 4806 |
NC_017160 | 0.000297 | 0.000257 | 0.001101 | 3430 |
NC_020089 | 0.000240 | 0.000212 | 0.000706 | 4006 |
Chromosome | Genome | Measure from | Number of STARTs and STOPs | |
---|---|---|---|---|
Markovian simulations | Non-Markovian simulations | |||
NC_000913 | 0.000307 | 0.000179 | 0.001372 | 4284 |
NC_000915 | 0.000500 | 0.000378 | 0.001157 | 1606 |
NC_002952 | 0.000259 | 0.000243 | 0.001195 | 2730 |
NC_004342 | 0.000369 | 0.000287 | 0.001067 | 3192 |
NC_004343 | 0.000857 | 0.000973 | 0.001650 | 266 |
NC_011900 | 0.000302 | 0.000282 | 0.001251 | 2070 |
NC_014479 | 0.000244 | 0.000234 | 0.001092 | 4276 |
NC_016445 | 0.000478 | 0.000266 | 0.001160 | 2860 |
NC_016446 | 0.000443 | 0.000485 | 0.001274 | 918 |
NC_016516 | 0.000398 | 0.000274 | 0.001367 | 2232 |
NC_016832 | 0.000395 | 0.000216 | 0.001165 | 4574 |
NC_017160 | 0.000274 | 0.000244 | 0.001182 | 3810 |
NC_020089 | 0.000220 | 0.000213 | 0.000731 | 3962 |
C Transition matrices for the annotated START/STOP sequences
Here we display the transition matrices estimated for the sequence of START/STOP codons on the primary and complementary strands of the 13 bacterial chromosomes studied.
1. Escherichia coli str. K-12 substr. MG1655, complete genome
In the estimated START/STOP transition matrices for the primary and complementary strands of E. Coli K-12, there appear two anomalous entries in the top-left corner of each matrix. Inspection of the annotation (NC_000913.2) available from GenBank reveals that the 603rd gene on the primary strand spans loci 1204594–1205365 relative to the 5′ end. It starts with GTG and finishes with an ATG codon.
Similarly, the two non-zero elements in the top-left corner of the transition matrix estimated for the complementary strand are explained by the 473rd gene on the complementary strand. This gene spans loci 1077648–1077866 relative to the 5′ end of the complementary strand. It starts with an ATG codon and finishes with a GTG codon.
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0005 | 0.0000 | 0.0000 | 0.6481 | 0.0784 | 0.2729 |
GTG | 0.0063 | 0.0000 | 0.0000 | 0.6266 | 0.0823 | 0.2848 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.6389 | 0.0833 | 0.2778 |
TAA | 0.8994 | 0.0831 | 0.0175 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8875 | 0.0938 | 0.0187 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.9209 | 0.0612 | 0.0180 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0005 | 0.0000 | 0.6572 | 0.0556 | 0.2867 |
GTG | 0.0061 | 0.0000 | 0.0000 | 0.5521 | 0.0982 | 0.3436 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.5405 | 0.1081 | 0.3514 |
TAA | 0.9135 | 0.0707 | 0.0159 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8984 | 0.0781 | 0.0234 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8946 | 0.0863 | 0.0192 | 0.0000 | 0.0000 | 0.0000 |
2. Helicobacter pylori 26695 chromosome, complete genome
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.5563 | 0.1688 | 0.2749 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.4750 | 0.2000 | 0.3250 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.5323 | 0.1935 | 0.2742 |
TAA | 0.8106 | 0.1103 | 0.0791 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8120 | 0.0977 | 0.0902 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8224 | 0.0981 | 0.0794 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.5852 | 0.1493 | 0.2655 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.4583 | 0.1944 | 0.3472 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.5000 | 0.2500 | 0.2500 |
TAA | 0.8374 | 0.0879 | 0.0747 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.7692 | 0.1077 | 0.1231 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8349 | 0.0826 | 0.0826 | 0.0000 | 0.0000 | 0.0000 |
3. Staphylococcus aureus subsp. aureus MRSA252 chromosome, complete
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.7486 | 0.1434 | 0.1080 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.7184 | 0.1748 | 0.1068 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.7023 | 0.1221 | 0.1756 |
TAA | 0.8219 | 0.0780 | 0.1001 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.7880 | 0.0924 | 0.1196 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8231 | 0.0816 | 0.0952 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.7276 | 0.1601 | 0.1123 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.7767 | 0.1262 | 0.0971 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.6460 | 0.1858 | 0.1681 |
TAA | 0.8483 | 0.0748 | 0.0768 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8165 | 0.0872 | 0.0963 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8354 | 0.0633 | 0.1013 | 0.0000 | 0.0000 | 0.0000 |
4. Leptospira interrogans serovar Lai str. 56601 chromosome I
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.5717 | 0.1278 | 0.3004 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.6371 | 0.1694 | 0.1935 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.5338 | 0.1673 | 0.2989 |
TAA | 0.7892 | 0.0609 | 0.1499 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.7500 | 0.0968 | 0.1532 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.7646 | 0.0697 | 0.1657 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.5846 | 0.1230 | 0.2923 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.5000 | 0.1944 | 0.3056 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.5704 | 0.1480 | 0.2816 |
TAA | 0.7663 | 0.0598 | 0.1739 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.7299 | 0.0806 | 0.1896 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.7570 | 0.0774 | 0.1656 | 0.0000 | 0.0000 | 0.0000 |
5. Leptospira interrogans serovar Lai str. 56601 chromosome II
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.5372 | 0.1405 | 0.3223 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.6000 | 0.0000 | 0.4000 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.6207 | 0.1034 | 0.2759 |
TAA | 0.7303 | 0.0562 | 0.2135 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8500 | 0.1000 | 0.0500 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.7647 | 0.0588 | 0.1765 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.6186 | 0.1237 | 0.2577 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.1111 | 0.2222 | 0.6667 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.6667 | 0.1481 | 0.1852 |
TAA | 0.7089 | 0.0506 | 0.2405 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.6667 | 0.1111 | 0.2222 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8056 | 0.0833 | 0.1111 | 0.0000 | 0.0000 | 0.0000 |
6. Streptococcus pneumoniae ATCC 700669, complete genome
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.6375 | 0.2071 | 0.1554 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.6667 | 0.2308 | 0.1026 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.6383 | 0.2340 | 0.1277 |
TAA | 0.9049 | 0.0443 | 0.0508 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.9200 | 0.0300 | 0.0500 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.9172 | 0.0414 | 0.0414 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.5979 | 0.2349 | 0.1672 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.6000 | 0.2400 | 0.1600 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.6750 | 0.2250 | 0.1000 |
TAA | 0.9100 | 0.0418 | 0.0482 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.9012 | 0.0782 | 0.0206 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.9412 | 0.0294 | 0.0294 | 0.0000 | 0.0000 | 0.0000 |
7. Bacillus subtilis subsp. spizizenii str. W23 chromosome, complete
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.6508 | 0.1247 | 0.2244 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.6051 | 0.1385 | 0.2564 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.5714 | 0.1389 | 0.2897 |
TAA | 0.7602 | 0.1047 | 0.1350 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.7683 | 0.1057 | 0.1260 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.7863 | 0.0903 | 0.1233 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.6388 | 0.1370 | 0.2243 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.5870 | 0.1902 | 0.2228 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.6064 | 0.1596 | 0.2340 |
TAA | 0.7780 | 0.0913 | 0.1307 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.7864 | 0.0809 | 0.1327 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.7905 | 0.0747 | 0.1349 | 0.0000 | 0.0000 | 0.0000 |
8. Vibrio cholerae O1 str. 2010EL-1786 chromosome 1, complete
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.6291 | 0.1627 | 0.2083 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.4963 | 0.2296 | 0.2741 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.5128 | 0.2308 | 0.2564 |
TAA | 0.8508 | 0.0955 | 0.0537 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8109 | 0.1134 | 0.0756 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8562 | 0.0936 | 0.0502 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.6472 | 0.1689 | 0.1839 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.6000 | 0.2154 | 0.1846 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.5192 | 0.1731 | 0.3077 |
TAA | 0.8377 | 0.0916 | 0.0706 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8387 | 0.1008 | 0.0605 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8297 | 0.0797 | 0.0906 | 0.0000 | 0.0000 | 0.0000 |
9. Vibrio cholerae O1 str. 2010EL-1786 chromosome 2, complete
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.6527 | 0.1715 | 0.1757 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.5789 | 0.2105 | 0.2105 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.3913 | 0.2609 | 0.3478 |
TAA | 0.8072 | 0.1019 | 0.0909 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8491 | 0.0660 | 0.0849 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8482 | 0.1161 | 0.0357 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.5891 | 0.1680 | 0.2429 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.3913 | 0.2174 | 0.3913 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.2653 | 0.2449 | 0.4898 |
TAA | 0.8480 | 0.0640 | 0.0880 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8537 | 0.0366 | 0.1098 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8268 | 0.0315 | 0.1417 | 0.0000 | 0.0000 | 0.0000 |
10. Propionibacterium acnes TypeIA2 P.acn33 chromosome, complete
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.1058 | 0.0579 | 0.8363 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.1199 | 0.0959 | 0.7842 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.2581 | 0.0323 | 0.7097 |
TAA | 0.6850 | 0.2913 | 0.0236 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.5467 | 0.4000 | 0.0533 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.7279 | 0.2459 | 0.0262 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.1098 | 0.0600 | 0.8301 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.1174 | 0.0772 | 0.8054 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.1143 | 0.1429 | 0.7429 |
TAA | 0.6880 | 0.2640 | 0.0480 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.6133 | 0.3333 | 0.0533 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.7107 | 0.2620 | 0.0273 | 0.0000 | 0.0000 | 0.0000 |
11. Salmonella enterica subsp. enterica serovar Typhi str. P-stx-12
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.5693 | 0.1013 | 0.3294 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.4891 | 0.1397 | 0.3712 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.5649 | 0.0992 | 0.3359 |
TAA | 0.8540 | 0.0860 | 0.0600 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8413 | 0.1151 | 0.0437 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8466 | 0.1047 | 0.0486 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.6074 | 0.1017 | 0.2909 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.6144 | 0.0932 | 0.2924 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.5594 | 0.1189 | 0.3217 |
TAA | 0.8374 | 0.1055 | 0.0571 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8283 | 0.0944 | 0.0773 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8299 | 0.1015 | 0.0687 | 0.0000 | 0.0000 | 0.0000 |
12. Yersinia pestis D182038 chromosome, complete genome
Primary strand | ||||||
---|---|---|---|---|---|---|
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.5631 | 0.1409 | 0.2960 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.5172 | 0.1724 | 0.3103 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.6230 | 0.1311 | 0.2459 |
TAA | 0.8383 | 0.0881 | 0.0736 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.7886 | 0.1220 | 0.0894 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8254 | 0.1171 | 0.0575 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | ||||||
ATG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.5433 | 0.1508 | 0.3059 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.5222 | 0.1556 | 0.3222 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.7218 | 0.0752 | 0.2030 |
TAA | 0.8379 | 0.0825 | 0.0796 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.8489 | 0.1043 | 0.0468 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.8252 | 0.1119 | 0.0629 | 0.0000 | 0.0000 | 0.0000 |
13. Mycobacterium tuberculosis 7199-99 complete genome
Primary strand | |||||||
---|---|---|---|---|---|---|---|
ATG | CTG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.1513 | 0.3059 | 0.5428 |
CTG | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.1429 | 0.8571 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.1612 | 0.2701 | 0.5687 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.2091 | 0.2455 | 0.5455 |
TAA | 0.6254 | 0.0032 | 0.3206 | 0.0508 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.5852 | 0.0069 | 0.3546 | 0.0534 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.6134 | 0.0018 | 0.3279 | 0.0569 | 0.0000 | 0.0000 | 0.0000 |
Complementary strand | |||||||
ATG | CTG | GTG | TTG | TAA | TAG | TGA | |
ATG | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.1672 | 0.3045 | 0.5283 |
CTG | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.1250 | 0.1250 | 0.7500 |
GTG | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.1322 | 0.3098 | 0.5580 |
TTG | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.1333 | 0.2667 | 0.6000 |
TAA | 0.6414 | 0.0066 | 0.3059 | 0.0461 | 0.0000 | 0.0000 | 0.0000 |
TAG | 0.6030 | 0.0066 | 0.3355 | 0.0548 | 0.0000 | 0.0000 | 0.0000 |
TGA | 0.5991 | 0.0019 | 0.3591 | 0.0400 | 0.0000 | 0.0000 | 0.0000 |
References
Albrecht-Buehler, G. (2006): “Asymptotically increasing compliance of genomes with Chargaff’s second parity rules through inversions and inverted transpositions,” PNAS, 103, 17828–17833.10.1073/pnas.0605553103Search in Google Scholar PubMed PubMed Central
Bouaynaya, N. and D. Schonfeld (2008): “Non-stationary analysis of coding and non-coding regions in nucleotide sequences,” IEEE J. Selected Topics in Signal Processing, 2, 357–364.10.1109/JSTSP.2008.923852Search in Google Scholar
Chargaff, E. (1950): “Chemical specificity of nucleic acids and mechanism of their enzymatic degradation,” Experientia, 6, 201–209.10.1007/BF02173653Search in Google Scholar PubMed
DeGroot, M. (1991): Probability and statistics, 3rd edition, Reading, MA: Addison-Wesley.Search in Google Scholar
Fedorov, F., S. Saxonov and W. Gilbert (2002): “Regularities of context-dependent codon bias in eukaryotic genes,” Nucleic Acids Res., 30, 1192–1197.Search in Google Scholar
Hart, A. and S. Martínez (2011): “Statistical testing of chargaff’s second parity rule in bacterial genome sequences,” Stoch. Models, 27, 1–46.Search in Google Scholar
Kullback, S. (1959): Information theory and statistics, New York: John Wiley and Sons.Search in Google Scholar
Kullback, S. and R. Leibler (1951): “On information and sufficiency,” Ann. Math. Stat., 22, 79–86.Search in Google Scholar
Levin, D., Y. Peres and E. Wilmer (2008): Markov chains and mixing times, Providence: Amer. Math. Soc.10.1090/mbk/058Search in Google Scholar
Li, W. and K. Kaneko (1992): “Long-range correlation and partial 1/f spectrum in a noncoding dna sequence,” Europhysics Letters, 17, 655.10.1209/0295-5075/17/7/014Search in Google Scholar
Ljung, G. and G. Box (1976): “On a measure of lack of fit in time series models,” Biometrika, 65, 297–303.10.1093/biomet/65.2.297Search in Google Scholar
Mitchell, D. and R. Bridge (2006): “A test of Chargaff’s second rule,” Biochem. Biophys. Res. Commun., 40, 90–94.Search in Google Scholar
Peng, C., S. Buldyrev, A. Goldberger, S. Havlin, F. Sciortino, H. Simons and E. Stanley (1992): “Long-range correlations in nucleotide sequences,” Nature, 356, 168–170.10.1038/356168a0Search in Google Scholar PubMed
Rudner, R., J. Karkas and E. Chargaff (1968): “Separation of B. subtilis DNA into complementary strands. III. direct analysis,” Proc Natl Acad Sci USA, 60, 921–922.10.1073/pnas.60.3.921Search in Google Scholar PubMed PubMed Central
©2014 by De Gruyter