Article
Search for Dispersed Repeats in Bacterial Genomes Using an
Iterative Procedure
Eugene Korotkov 1,*, Yulia Suvorova 1, Dimitry Kostenko 1 and Maria Korotkova 2
Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Bld. 2,
33 Leninsky Ave., 119071 Moscow, Russia; suvorovay@gmail.com (Y.S.); dk0stenko@yandex.ru (D.K.)
2 Moscow Engineering Physics Institute, National Research Nuclear University MEPhI,
31 Kashirskoye Shosse, 115409 Moscow, Russia; bioinf@rambler.ru
* Correspondence: katrin2@biengi.ac.ru; Tel.: +7-926-724-8271
1
Abstract: We have developed a de novo method for the identification of dispersed repeats based on
the use of random position-weight matrices (PWMs) and an iterative procedure (IP). The created
algorithm (IP method) allows detection of dispersed repeats for which the average number of
substitutions between any two repeats per nucleotide (x) is less than or equal to 1.5. We have shown
that all previously developed methods and algorithms (RED, RECON, and some others) can only
find dispersed repeats for x ≤ 1.0. We applied the IP method to find dispersed repeats in the genomes
of E. coli and nine other bacterial species. We identify three families of approximately 1.09 × 106, 0.64
× 106, and 0.58 × 106 DNA bases, respectively, constituting almost 50% of the complete E. coli genome.
The length of the repeats is in the range of 400 to 600 bp. Other analyzed bacterial genomes contain
one to three families of dispersed repeats with a total number of 103 to 6 × 103 copies. The existence
of such highly divergent repeats could be associated with the presence of a single-type triplet
periodicity in various genes or with the packing of bacterial DNA into a nucleoid.
Keywords: dispersed repeats; bacteria; genome; dynamic programming; iteration
Citation: Korotkov, E.; Suvorova, Y.;
Kostenko, D.; Korotkova, M. Search
for Dispersed Repeats in Bacterial
Genomes Using an Iterative
Procedure. Int. J. Mol. Sci. 2023, 24,
10964. https://doi.org/10.3390/
ijms241310964
Academic Editors: Zhiping Liu, Han
Zhang and Junwei Han
Received: 11 June 2023
Revised: 27 June 2023
Accepted: 28 June 2023
Published: 30 June 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution
(CC
BY)
license
(https://creativecommons.org/license
s/by/4.0/).
1. Introduction
Dispersed repetitive DNA sequences constitute a significant portion of existing
genomes. In the human genome, more than one third is occupied by dispersed repeats,
which are primarily copies of transposable elements [1], whereas in other organisms, the
proportion could be much higher [2,3]. Accurate localization of dispersed repeats in the
sequenced genome can help determine the functional significance and evolutionary origin
of genomic sequences. Considering that the genomes of many eukaryotic organisms have
already been sequenced and those of all living organisms are expected to be sequenced in
the next decade according to the Earth BioGenome Project [4], the search for dispersed
repeats in various genomes emerges as an important task of bioinformatics.
At present, many mathematical methods and algorithms have been developed for
the identification of dispersed repeats [5]. Most of the programs contain two independent
parts: the first (A) generates a dispersed repeat sequence, whereas the second (B) searches
the genome under study for sequences similar to those generated in part A. Part A can be
of two types: A1 uses a library of already published dispersed repeats, whereas A2 uses
dispersed repeats created de novo by various mathematical methods. The dispersed
repeats of the library or those created de novo are used in part B to search for similar
sequences in the genome.
Programs such as RepeatMasker [6], Censor [7], and MaskerAid [8] are A1-based.
The mathematical methods implemented in these programs detect repeats in the genomes
by comparison with Repbase [9], a database of repetitive sequences. Thus, the detection
of new repeats by these methods is limited to the ones existing in the library.
Int. J. Mol. Sci. 2023, 24, 10964. https://doi.org/10.3390/ijms241310964
www.mdpi.com/journal/ijms
Int. J. Mol. Sci. 2023, 24, 10964
2 of 17
Programs such as RED [10], Recon [11], PILER [12], RepeatScout [13], and
RepeatFinder with REPUter [14] are A2 based (reviewed in [5]). These methods create de
novo sequences of dispersed repeats using similarity search, word counting, and
signature-based approaches. The effectiveness of the A2-based methods in the discovery
of new families of dispersed repeats and/or analysis of new genomes is superior to that of
A1-based methods.
Many similar search programs such as BLAST [15], FASTA [16], MEGA [17], and
nHMMER [18] are based on part B. The use of nHMMER allows the search for dispersed
repeats by multiple alignment, which seems to be preferable to searches organized by
consensus sequence or individual repeats. COFFEE [19] or MUSCLE [20] can be used to
construct multiple alignments for nHMMER as these methods allow calculation of
statistically significant multiple alignments with up to x < 2.4, where x is the average
number of substitutions per nucleotide when comparing dispersed repeats of the same
family [21]. However, for this purpose it is better to use MAHDS [21] which allows for the
discovery of a statistically significant multiple alignment with x < 4.4, thus providing
detection of more divergent copies of dispersed repeat families.
According to functional significance and structural features, all existing dispersed
repeats identified in eukaryotic genomes can be roughly divided into five classes: short
interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs), long
terminal repeats (LTRs), DNA transposons, and others [1]. Prokaryotic genomes, which
are composed mainly of coding sequences, contain significantly fewer dispersed repeats.
Thus, in the E. coli genome the average distance between genes is about 118 nucleotides
[22] and the proportion of dispersed repeats is only 0.7%. Palindromic sequences with a
length of 40 bp (called REP, BIME, or PU) represent the largest class of repeats in the E.
coli genome. The longest repeated sequences in E. coli K-12 are five Rhs elements ranging
from 5.7 to 9.6 kb [22]. The E. coli K-12 genome also contains small transposable elements
called insertion sequences (ISs). However, dispersed repeats with a length of more than
300 bases and a large number of copies have not been detected in the E. coli genome.
An important question arises as to whether it is possible to identify all dispersed
repeats existing in eukaryotic or prokaryotic genomes. The problem is that repeats can
accumulate a large number of mutations, such as base substitutions or insertions and
deletions (indels). As a result, the similarity between repeats from the same family can
become extremely weak and such repeats can be missed. Therefore, effective methods to
identify highly divergent repeats in the genome should be developed.
The search for dispersed repeats can be affected by the limitations of the methods
based on part A. The performance of the A1-based methods obviously depends on the
volume of created repeat libraries, which in turn is determined by the data obtained by
experimental or theoretical studies, while the effectiveness of the A2-based methods is
limited by the possibility to find dispersed repeats de novo. Let us consider a simple
example of searching for any two repeats of length L1 in a sequence of length L. To perform
this task, we select two windows of length L1 in the sequence of length L and compare
their sequences. The first window runs from the beginning to the end of sequence L, and
the second runs from the end of the first sequence to the end of sequence L. As a result,
the number of comparisons between the two sequences (N) is proportional to L2/2. In this
case, the probability of detecting random similarity between two sequences of length L1
(p) should be no more than k/N ≈ 0.01. To exclude noise in the form of random pairwise
similarities, we take k = 0.01, which means that the probability of finding random
similarity is 1%. From probability p, we can estimate the level of statistically significant
similarity. Let us first calculate the argument of normal distribution Z0 for which P(Z > Z0)
= p. Based on Z0, we can estimate the number of identical bases s0 for which the similarity
of two sequences of length L1 is considered statistically insignificant (or noise):
s0 =
s + z0 L1t (1 − t ) , where t is the probability of random similarity between two
nucleotides taken as 0.25. In the case of a bacterial genome, L = 4 × 106. If repeat length L1
Int. J. Mol. Sci. 2023, 24, 10964
3 of 17
is 300, then probability p is approximately 10−15, which corresponds to Z0 ≈ 8.0; thus, s0 =
135, which corresponds to 45% similarity and x ~ 0.55 (formula 11 in [23]). This means that
if a family of dispersed repeats has accumulated many mutations and their similarity is
less than 45%, it is very difficult to detect such a family de novo in a 4 × 106-base long
sequence by A1 methods. A similar x value is shown in [5]. It should be noted that, in
reality, x may be smaller because in addition to base substitutions, indels could be present
in dispersed repeats. The use of word counting and signature-based methodologies to
search for dispersed repeats cannot significantly improve the situation, because at x = 0.75
sequence similarity is already at the level of random noise (~25%) and word frequencies
may differ insignificantly from those expected for random sequences.
In this study, we showed that the known de novo methods could find dispersed
repeats with x ≤ 1.00, whereas part B-based methods such as nHMMER could do the same
with x ≤ 3.0 using a previously created multiple alignment, indicating that the main
constraints on the identification of dispersed repeats are related to part A. Thus, a family
of unknown dispersed repeats in the genome which has accumulated a large number of
base substitutions and indels (x > 1.0) is unlikely to be detected by A2-based methods.
Consequently, the application of part B-based methods is impossible because there is no
sequence, consensus, or multiple alignment to search for the dispersed repeats of such a
family in the genome. Therefore, highly divergent repeat families (x > 1.0), which could
be present in already sequenced genomes, cannot be detected because of the limitations
of A2-based methods.
Here, we report a method for identifying dispersed repeats based on the use of
random positional weight matrices (PWMs) and an iterative procedure (IP). The
developed algorithm allows for the detection of dispersed repeats with x ≤ 1.5, which
significantly exceeds the capacity of all modern methods based on part A2 (x ≤ 1.0). We
applied the new IP method to search for dispersed repeats in the genome of E. coli and
nine other bacterial species and showed that the bacterial genomes contained families of
dispersed repeats with copy numbers >103 and lengths of 400–600 bases.
We chose bacterial genomes for this analysis because they lack long dispersed repeats
(>300 b.p.) with copy numbers greater than 103 and therefore the results obtained cannot
be related to known families of dispersed repeats. Additionally, bacterial genomes are
significantly shorter than eukaryotic genomes and their analysis does not require very
high computational power. At the same time, weakly similar dispersed repeats in bacterial
repeats may be present due to the stacking of bacterial DNA in the nucleoid [24]. Possible
functional significance and the origin of the dispersed repeats are discussed.
2. Results
2.1. Using Model Sequences for the IP Method
We generated a random sequence Stest with a length of 4 × 106 bases and randomly
inserted it into sequences from set Q(x), which contained 103 sequences with x
substitutions relative to each other. Overall, 500 sequences from set Q(x) were introduced
into sequence Stest in the forward orientation and 500 in the reverse orientation. Then, the
IP method was applied to identify the repeats. The results shown in Table 1 (where
columns 1–4 show the number of repeats found in the first four families) indicate that the
repeats were detected in the first and second families (columns 1 and 2), whereas the other
two families (columns 3 and 4) contained sequences combined randomly. The level of
random noise was 145 ± 35 sequences in all four generated repeat families. The results in
Table 1 indicate that IP clearly separated forward and reverse sequences into two families
and consistently detected two repeat families with x up to 1.5.
Int. J. Mol. Sci. 2023, 24, 10964
4 of 17
Table 1. Numbers of repeats identified in the four created repeat families using the IP method.
x
1
2
3
4
0
508
490
51
71
0.5
503
502
55
62
0.75
507
496
95
75
1.0
505
502
102
113
1.25
506
501
85
92
1.5
501
483
112
101
1.75
166
152
139
124
2.0
125
144
138
132
4.0
114
127
132
90
A model sequence of 4 × 106 bases containing 500 repeats in the forward and 500 repeats in the
reversed orientation was used; x is the average number of substitutions per nucleotide between
family members.
The search for sequences such as the PWM (Section 4.3) was performed in only one
direction. Therefore, forward and reverse repeats could create two separate families; as a
result, to conclusively determine the number of the identified families, we needed to check
for possible association of the two PWM families into a single family by taking into
account inversion and complementarity. Such analysis indicated that the similarity of the
PWMs of the three dispersed repeat families was statistically insignificant, indicating that
the identified families cannot be combined. Below (Section 2.3), we show that the reason
for this is the association of the families of dispersed repeat families found in the E. coli
genome with triplet periodicity, which in different families is not similar with regard to
inversion and complementarity.
2.2. Comparison of the IP Method with RED, RECON, RepeatMasker, BLAST, and nHMMER
A previous study has indicated that the known methods of de novo search for
dispersed repeats cannot detect them if the number of mutations in the dispersed repeats
that originate from a common ancestor exceeds 25–30 [5], which corresponds to x ≤ 0.6. Our
calculations (Section 1) showed that by using pairwise comparison, such a search is possible
at x ≤ 0.55, which roughly corresponds to the result obtained in [5]. We chose RED (which
uses K-mers) [10] as one of the popular algorithms for finding repeats and applied it to
identify 103 artificial repeats of 600 bases randomly scattered across a sequence of 4 × 106
bases. A total of nine sequences were created, for which dispersed repeats had different
numbers of mutations relative to each other, i.e., x = 0–2.0. Figure 1 shows that RED found
100%, 40%, and practically 0% of dispersed repeats with x ≤ 1.0, x = 1.25, and x ≥ 1.5,
respectively, indicating that RED could reliably identify dispersed repeats only with x ≤ 1.0.
At the same time, the application of the IP method to the nine artificial sequences resulted
in the detection of dispersed repeats with x ≤ 1.5 (Figure 1), which is a better result than that
of RED.
We also examined the performance of RECON [11] in finding families of dispersed
repeats with different degrees of divergence. The results indicate that, similar to RED,
RECON could find dispersed repeats only with x ≤ 1.0. However, at x = 1.00, RECON,
instead of finding just one repeat family, detected 108 families containing an overall number
of 825 repeats, which is an incorrect result. Therefore, RECON could detect one family of
repeats only with x ≤ 0.8.
Int. J. Mol. Sci. 2023, 24, 10964
5 of 17
Figure 1. Comparative performance of the IP and RED methods in search for dispersed repeats in
artificial sequences. The search was performed in an artificial sequence of 4 × 106 bases containing 103
repeats (each of 600 bases but with different x), which were randomly inserted from set Q(x) containing
repeats with the same number of random mutations relative to original sequence Sm as well as indels
at random positions. Black and white circles indicate the IP and RED methods, respectively; x is the
average number of substitutions per nucleotide and N is the number of repeats identified.
In a previously study, the same analysis was performed using the RepeatMasker
program, and the results indicate that it finds almost 100% of dispersed repeats for x ≤ 0.5
but 25% and less than 5% for x = 0.75 and for x = 1.0 [25].
Many of the de novo methods mentioned in [5] use the BLAST program to compare
sequences with themselves and then perform the assembly of dispersed repeats from the
found similarities. Therefore, it was interesting to analyze the capability of BLAST to
search for dispersed repeats in the same artificial sequence S(x) (4 × 106 bases) carrying 103
randomly inserted 600 base repeats from set Q(x), which had the same number of random
mutations relative to original sequence Sm, two indels at random positions, and x
substitutions per nucleotide between repeats. The word_size parameter in BLAST was
chosen to be 4, which gave the best result. To evaluate the effectiveness of BLAST, we
randomly created 100 sequences of S(x) containing repeats from different Q(x) (E_value
was chosen to be 100). E-value is the number of expected hits of similar score that could
be found just by chance. Table 2 shows the numbers of average repeats per sequence S(x)
for each x. The results indicate that BLAST could find pairwise similarity for x ≤ 1.0 but
failed to do so for x > 1.0.
Table 2. Numbers of dispersed repeats identified by the BLAST, nHMMER, and IP methods.
x
0.1
0.25
0.5
0.75
1.0
1.25
1.5
2.0
2.5
3.0
4.0
20.0
BLAST
1000 1000 1001 1000 1000
2.9
1.4
1.1
2.0
0.2
1.1
1.2
nHMMER 1004 1002 1004 1006 1002 1002 1004 1003 992 668
21
0
IP
1068 1006 1003 1002 1005 1002 1004 1003 1065 907 221
80
Model sequence S(x) of 4 × 106 bases containing 103 dispersed repeats was used; x is the average
number of base substitutions per nucleotide between two dispersed repeats.
Thus, previous findings with the de novo A2-based methods [5] and the results of the
present analysis suggest that the currently used methods can identify dispersed repeats
with x ≤ 1.0 but skip those with x > 1.0. In contrast, the IP method can identify repeats with
x ≤ 1.5, i.e., those missed by the other methods.
Int. J. Mol. Sci. 2023, 24, 10964
6 of 17
We also compared the performance of the IP method with that of nHMMER, which
is one of the best part B-based methods to find dispersed repeats with already known
multiple alignments. In this test, we aligned all 103 sequences from set Q(x), and since the
placement of indels in these sequences with respect to Sm was known, it was not difficult
to construct a multiple alignment, which was used by nHMMER to create a hidden
Markov model and search for dispersed repeats in sequence S(x).
We also searched for dispersed repeats in sequence S(x) using the IP method. To
correctly compare the IP method with nHMMER in search for repeats with known
multiple alignments, we used Sm as a sequence from the library to create a PWM (Figure
2, step 4), when indices i and j were limited to 1 (i.e., one cycle of dispersed repeat search).
Figure 2. Diagram of the IP algorithm used in this study to search for de novo dispersed repeat.
The created PWM had a length of 600 bases and 16 rows and was filled in as:
PWM(n,i) = PWM(n,i) + 1 for all i from 2 to 600 (here, n = sm(i−1) + 4sm(i) and i is the column
number). Then, for i = 1 n = sm(600) + 4sm(1).
Int. J. Mol. Sci. 2023, 24, 10964
7 of 17
The results shown in Table 2 indicate that the IP method, similar to nHMMER, could
find dispersed repeats using a known alignment. Both methods perform reliably up to x
≤ 3.0; however, the IP method was slightly more efficient at x = 3.0.
2.3. Search for Dispersed Repeats in the E. coli Genome Using the IP Method
The escherichia_coli_str_k_12sbstr_mg1655_gca_000005845.ASM584v2.49 sequence
was obtained from http://bacteria.ensembl.org/index.html/, accessed on 1 January 2023.
To search for dispersed repeats in the E. coli genome with the IP method, we used length
L1 = 600 bases (Section 4.1). The results shown in Figure 3 indicate that, for a random
sequence, the IP method generated families of 145 ± 35 repeats, because the iterative
algorithm (Figure 2) can always capture a certain number of sequences and build PWMs.
In the case of the E. coli genome, the respective numbers of dispersed repeats found in the
three families were 2239, 1170, and 1024. The volume of the other repeat families was close
to random, and the probability of finding families of such volume in a random sequence
of the same length as the genome is extremely low. The coordinates of the found repeat
families and their alignment with the PWM (sequence S2, Sections 4.3 and 4.5) are shown
in Supplementary Materials in additional files fam1.txt, fam2.txt, and fam3.txt, and the
PWMs created for these families are shown in files pwm1.txt, pwm2.txt, and pwm3.txt.
3000
2500
2000
1500
N
1000
500
0
0
5
10
15
20
Group number
Figure 3. The number of repeats in the groups created for the E. coli genome. Black circles on a
continuous
curve
indicate
the
groups
for
the
escherichia_coli_str_k_12_substr_mg1655_gca_000005845 genome; white circles indicate the size of
the groups created for the same genome, in which the codons in coding sequences are mixed
randomly; black circles on a discontinuous curve indicate the size of the groups created for a random
sequence of the same length as the escherichia_coli_str_k_12_substr_mg1655_gca_000005845
genome. N is the number of repeats identified.
We also constructed an artificial sequence based on the E. coli genome, where the
codons in each gene were randomly shuffled to ensure that any similarity of the coding
sequences was absent but where the triplet periodicity was preserved because the first,
second, and third codon positions did not change [26]. In Figure 3, the volume of the
created families for this sequence is indicated by white circles. The results show that if the
codons in the genes were conserved, dispersed repeat families of sufficiently large
Int. J. Mol. Sci. 2023, 24, 10964
8 of 17
volumes could still be created, indicating that the identified repeat families were
associated with the triplet periodicity of coding sequences [27].
Next, we created an artificial sequence containing the E. coli genome in which all noncoding sequences were randomly mixed and used it to search for dispersed repeat families
with the IP method. Figure 4 shows that, in this case, we could still find families of
dispersed repeats, indicating that the non-coding regions in the E. coli genome do not
contain a significant number of dispersed repeats. This result was verified by calculating
the proportion of non-coding regions in dispersed repeats of families 1, 2, and 3 (Table 3),
which confirmed that most of the found repeat families were not associated with noncoding sequences.
1800
1600
1400
1200
1000
N
800
600
400
200
0
0
5
10
15
20
25
Group number
Figure 4. The number of repeats in the coding sequences of the E. coli genome. Black circles indicate
groups created for the escherichia_coli_str_k_12_substr_mg1655_gca_000005845 genome, in which
all non-coding sequences were randomly mixed; white circles indicate the size of the groups created
for the randomly mixed genome. N is the number of repeats identified.
Next, we analyzed the length distribution of the repeats in each of the three families
(Figure 5). The repeats of the first family (485 bases) were slightly shorter than those of the
second and third families (548 and 564 bases, respectively).
Table 3. Distribution of the found repeats from families 1, 2, and 3 (Figure 3) according to the
proportion of non-coding regions.
Families
1
2
3
Proportion of Non-Coding Sequences
0.0–0.1 0.1–0.2 0.2–0.3 0.3–0.4 0.4–0.5 0.5–0.6 0.6–0.7 0.7–0.8 0.8–0.9 0.9–1.0
1956
129
58
36
22
7
3
2
1
25
918
97
60
32
13
11
4
4
3
28
709
116
72
43
29
15
8
2
3
27
Int. J. Mol. Sci. 2023, 24, 10964
9 of 17
800
600
400
N
200
0
350
400
450
500
550
600
650
Length
Figure
5.
Length
distribution
of
dispersed
repeats
found
in
the
escherichia_coli_str_k_12_substr_mg1655_gca_000005845 genome. Repeats of families 1, 2, and 3
are indicated by black circles on a continuous line, white circles, and black circles on a discontinuous
line, respectively.
2.4. Triplet Periodicity of Dispersed Repeat Families in the E. coli Genome
Next, we investigated the triplet periodicity in the sequences of the three found
repeat families. For this, we filled in matrix M(3,4) for each repeat sequence: s1(i)) =
M(f(s2(i)) + 1, s1(i)) + 1 for i from 1 to L, where L is the repeat length, s1(i) is a sequence
element of the found repeat (sequence S1, Section 4.3), and s2(i) is a column sequence
element of the PWM (sequence S2, Section 4.3). We calculated function f(s2(i)) =
s2(i)−3int((s2(i)−0.1)/3.0) and found that it was 1, 2, or 3. If s1(i) or s2(i) were equal to zero
(deletion), then 1 was not added to M(f(s2(i)) but added to i. After determining matrix M,
we calculated mutual information I as:
=
I
3
4
3
4
∑∑ m(i, j ) ln m(i, j ) − ∑ X (i) ln X (i) − ∑ Y ( j ) ln Y ( j ) + L ln L
=i 1 =j 1
=i 1
=j 1
4
3
j =1
i =1
(1)
where m(i,j) is an element of matrix M(3,4), X (i ) = ∑ m(i, j ) , Y ( j ) = ∑ m(i, j ) , and
3
4
L = ∑∑ m(i, j ) .
=i 1 =j 1
Then, we calculated the argument of normal distribution Z = (4I)0.5 − (11.0)0.5 for each
sequence in the repeat family and plotted Z distribution for all family members. Figure 6
shows the distribution for the first repeat family. For randomly shuffled sequences, the Z
distribution was close to normal (black circles), but for repeat sequences without alignment
with the PWM, the distribution was markedly shifted to the right (white circles), indicating
that the coding sequences had triplet periodicity [26]. The Z distribution for sequences
aligned with the PWM was shifted even more to the right compared with the two
distributions mentioned above (black circles on a dashed line; Figure 6), indicating that the
alignment of repeats with the PWM results in a clearer triplet periodicity.
Int. J. Mol. Sci. 2023, 24, 10964
10 of 17
Figure 6. Distribution of the dispersed repeats from the first group according to the level of triplet
periodicity. Here, X is the argument of normal distribution indicating the level of statistical
significance of triplet periodicity. Black circles on a continuous line show triplet periodicity
calculated for randomly mixed sequences from the first group; white circles show triplet periodicity
for the first group sequences found in the escherichia_coli_str_k_12_substr_mg1655_gca_000005845
genome without alignment with the PWM for this family (sequences without indels as they are in
the genome); black circles on a dashed line show the level of triplet periodicity in the sequences that
are a part of the alignments with the first group PWM (sequences with indels).
We combined all matrices M from all sequences for each family into one matrix, created
matrices M1, M2, and M3, and converted each matrix element into a normal distribution
argument using normal approximation for binomial distribution. For this, we calculated
partial sums of X(i) and Y(j) as we did in Figure 6 and determined probabilities p(i,j) =
X(i)Y(j)/L2 and the argument of normal distribution zk(i,j) = {mk(i,j)-Lp(i,j)}/{Lp(i,j)(1.0-p(i,j)}0.5
(where k indicates the number of the dispersed repeat family). The resulting matrices are
shown in Table 4; column numbers are s2(i)mod(3), where s2(i) is the column sequence
element of the PWM (sequence S2, Section 4.2). It should be noted that the column numbers
in matrices M1, M2, and M3 are unrelated to the reading frame in the genes because there are
indels in sequences S1 and S2. From the matrices, we could conclude that the nucleotides
were distributed extremely unevenly across the matrix positions. Thus, the first group of
dispersed repeats contained more than expected nucleotides C and G in the first position, C
in the second position, and A and T in the third position.
Table 4. Matrices with dimensions 3 × 4 (A, B and C) which contain the normal distribution argument
obtained by normal approximation of binomial distribution for each cell of matrices M1, M2 and M3.
A
B
C
DNA
Bases
1
2
3
1
2
3
1
2
3
A
−48.7
−3.5
52.2
9.2
22.8
−32.0
24.7
3.9
−28.6
T
−9.8
−27.4
37.3
−40.4
7.7
32.7
−19.1
28.0
−8.8
C
17.2
37.1
−54.4
−20.4
−22.1
42.6
−6.8
−18.7
25.6
G
37.2
−8.6
−28.5
49.3
−7.13
−42.2
1.4
−12.1
10.7
Columns represent positions in a period of three bases; A, B, and C are matrices for sequences from
the first, second, and third groups of dispersed repeats obtained for the
escherichia_coli_str_k_12_substr_mg1655_gca_000005845 genome.
Int. J. Mol. Sci. 2023, 24, 10964
11 of 17
2.5. Comparison with Nucleoid-Associated Protein-Binding Sites
It is known that the spatial structure of bacterial genomes, including that of E. coli, is
maintained by so-called nucleoid-associated proteins (NAPs). By binding to DNA, these
proteins help stabilize and compact DNA and can also play regulatory functions. Several
such proteins, with different properties and DNA specificity, are known currently [24].
Experimental binding maps have been constructed for some NAPs using chip-seq
methods. It is possible that the DNA regions with remote similarity identified in this study
may be the binding sites for various NAPs. To test this hypothesis, we compared the
intervals found here with the binding sites of some NAPs (such as Fis, H-NS, and Ihf)
whose binding site coordinates were obtained from [28,29]. The intersection of these
coordinates was determined using the bedtools program [30]. The results reveal that there
was no statistical difference between the numbers of intersections of NAP-binding sites
with the found repeat families and with randomly located dispersed repeats of these
families, indicating the absence of a statistically significant intersection of the found
intervals with NAP-binding sites.
2.6. Search for the Families of Dispersed Repeats in the Genomes of Other Bacteria
To confirm that dispersed repeat families exist not only in E. coli but also in other
bacteria, we applied the IP method to search for dispersed repeats in the genomes of the
following bacterial species: Azotobacter vinelandii, Bacillus subtilis, Clostridium tetani,
Methylococcus capsulatus, Mycobacterium tuberculosis, Shigella sonnei, Treponema pallidum,
and
Yersinia
pestis
(genome
sequences
were
obtained
from
http://bacteria.ensembl.org/index.html). The results shown in Table 5 indicate that in all
the analyzed bacterial genomes, 1–2 repeat families could be detected at a statistically
significant level. The least number of repeats were identified in the genome of T. pallidum,
which may be due to its small size.
Table 5. Sizes of the first four groups dispersed repeats found in nine bacterial genomes.
Bacteria
Azotobacter vinelandii
Bacillus subtilis
Clostridium tetani
Methylococcus capsulatus
Mycobacterium tuberculosis
Shigella sonnei
Treponema pallidum
Xanthomonas campestris
Yersinia pestis
1
4565
2563
1605
2489
3343
2606
590
4622
1953
2
1357
768
640
375
1152
645
273
1348
43
3
322
340
168
280
299
519
83
359
35
4
178
305
111
95
103
358
46
75
43
Genome Size
5.3 × 106
4.2 × 106
2.8 × 106
3.3 × 106
4.4 × 106
5.0 × 106
1.1 × 106
5.1 × 106
4.8 × 106
3. Discussion
In this study, we developed a new IP method and applied it to search for the families
of dispersed repeats in the E. coli genome. As a result, we could identify three respective
families of approximately 1.09 × 106, 0.64 × 106, and 0.58 × 106 DNA bases (2.3 × 106 bases
in total), constituting almost 50% of the complete E. coli genome. Such extensive repeat
families could not be detected in the E. coli genome via the RED, RECON, or
Repeat_masker programs, but could be detected via the IP method, which could find de
novo repeat families with x ≤ 1.5, whereas all other programs found them with x ≤ 1.0.
It should be noted that in search of the genomes containing 5 × 105–1.1 × 107 DNA
bases, the level of false positives in each family was 145 ± 35 repeats. Such level of noise is
due to the fact that, at the initial step of the iterative procedure, the random matrix can
always find weak similarities (Z > 3.0) with some sequences (Figure 2). After creating a
new PWM based on these similarities, the Z value for these sequences increases. In total,
Int. J. Mol. Sci. 2023, 24, 10964
12 of 17
the iterative procedure can randomly include about 145 sequences in the PWM for which
Z would be over 5.0.
A legitimate question arises regarding the origin of such highly divergent families of
repeats and their functional significance. Our results (Section 2.4) indicate that each repeat
has a similar triplet periodicity, which can account for the similarity of these sequences
and classify them as one family. The emergence of triplet periodicity is partially related to
the use of the same synonymous codons [31,32]. Therefore, the origin of repeat families
could be associated with gene segments in which the same synonymous codons are used.
In this case, dispersed repeats of the same family may be found in genes with similar
transcriptional activity [33], whereas those with different activities could contain distinct
families of dispersed repeats.
In Section 2.5 we found no correlation between the detected dispersed repeats and
the binding sites of some of the proteins involved in nucleoid formation. Despite this, we
cannot completely reject the assumption that the identified repeats contribute to DNA
stacking in the nucleoid. Dispersed repeats create a certain markup of the bacterial
genome that may contribute to the spatial self-organization of bacterial DNA.
Since dispersed repeats exist not only in the genome of E. coli but also in those of
many other bacterial species (Table 5), it is also possible that the detected families of
repeats could be involved in the creation of the liquid crystal structure within bacterial
DNA through interactions between repeats within a family [34–36].
The IP method can be used to search for dispersed repeats in any DNA sequences,
including those from eukaryotic organisms. The main limitation is that the analyzed
sequence must be longer than 5 × 105 bp. This limitation is due to the fact that the IP
method uses an iterative procedure, meaning that at smaller lengths it is not possible to
start because there will be no hits with Z > 3.0 (Figure 2). At the same time, for eukaryotic
chromosomes with lengths more than 2 × 107 bp the computation time could be too long.
Therefore, the present version of IP can be used to find dispersed repeats in parts of
eukaryotic chromosomes <2 × 107 bp. The dispersed repeats found could then be used in
nHMMER to search for IP-detected repeats in the whole genome.
The
IP
method
is
currently
available
on
the
server
at:
http://victoria.biengi.ac.ru/shddr, accessed on, which is open for use. The search time for
dispersed repeats in the E. coli genome was just over five days, and we plan to increase
the capacity of this computational system as the number of users grows. If necessary, we
will also increase the volume of the computer cluster and shorten the search time for
dispersed repeats in a prokaryotic genome to about an hour or less.
4. Materials and Methods
Figure 2 illustrates the algorithm used in this work to search for dispersed repeats de
novo. The algorithm is iterative and can be divided into six steps explained in detail
below.
4.1. Calculation of the Random Matrix
To search for a family of dispersed repeats in sequence S with length L, we created a
random PWM with 16 rows and L1 columns (step 1, Figure 2), in which L1 was the
maximum repeat length that could be identified in sequence S using local alignment. The
16
L1
created matrix was then transformed so that sum R 2 = ∑∑ pwm(i, j ) 2 had constant
=i 1 =j 1
value R02 for all matrices used below to find the similarity between the PWM and
sequence S (Section 4.2) (the procedure of matrix transformation is described in detail in
16
L1
[37]). For these matrices, sum K = ∑∑ pwm(i, j ) p1 (i ) p2 ( j ) was also kept equal to K0. In
=i 1 =j 1
this formula, pwm(i,j) is the element of the PWM on row i and column j, p2(j) = 1/L1 and
p1(i) = f(k)f(l), where f(k) and f(l) are the probabilities of encountering nucleotides of types
Int. J. Mol. Sci. 2023, 24, 10964
13 of 17
k and l, respectively, in the analyzed sequence S (Section 4.2) (k and l could be A, T, C, or
G and pair kl formed index i).
In the present study, we used L1 = 600, K0 = −1.0, and R02 = 300 L10.5 ; assuming that K0 =
−1.0 permits the accurate determination of the start and end points of the local alignment
[25] between the PWM and sequence S1 (Section 4.2). Thereafter, for local alignment we
used the PWM with only these parameters.
4.2. Calculation of Fmax and σ for the PWM
To calculate Fmax and σ, we randomly shuffled sequence S (step 2, Figure 2). After
choosing t = 1, in sequence S we selected a window (sequence S1) with the beginning at
point t and end at point t + L1 + 50. If letter N occurs in sequence S1, then 10 should be
added to t and sequence S1 should be created again.
Let Fmax be the maximum value of the similarity function after the local alignment
between the PWM and sequence S1, performed by taking into account the correlation of
neighboring bases in sequence S1 [38], and whose elements we have denoted as s1(i) for i
from 1 to L1. Briefly, we first recoded the entire sequence S1, in which DNA bases became
A = 1, T = 2, C = 3, and G = 4, and then created sequence S2, in which elements s2(j) = j for j
from 1 to L1 and which contained the column numbers of the PWM. Then, similarity
function F was calculated as:
0
F (i − 1, j − 1) + pwm(n, s ( j ))
2
F (i, j ) = max
F
i
j
pwm
n
s
j
−
−
+
(
1,
1)
(
,
(
))
2
x
Fy (i − 1, j − 1) + pwm(n, s2 ( j ))
(2)
F (i − 1, j ) − d
Fx (i, j ) = max
Fx (i − 1, j ) − e
(3)
F (i, j − 1) − d
Fy (i, j ) = max
Fy (i, j − 1) − e
(4)
Initial conditions were: F(0,0) = F(i,0) = F(0,i) = 0.0 and n = s1(k) + 4(s1(i)−1)), where I
and j each ranged from 2 to L1; for i = 1, n = s1(1) and for j = 1, n = s1(i). This choice for the
initial n values had little effect on the final alignment results. By considering variable n,
we took into account the correlation of neighboring nucleotides in sequence S1. To
calculate n, we should find previous position k, which had already been included in the
alignment and which had been calculated as previously described (Section 2.4 and
Equation (7) in [38]). Here, we used d = 35.0 and e = 3.5.
First, we calculated matrix F and its maximum value Fmax for t = 1 and then added 10
bases to t and repeated the calculation up to L-L1-49. As a result, we obtained vector
Fmax(t) and used it to calculate mean Fmax and σ for the used PWM. Together with matrix
F, we filled in the matrix of inverse transitions, where each cell (i,j) had the coordinates of
the cell or cells of matrix F from which we reached point (i,j). Then, we found coordinates
(imax,jmax) for Fmax and those for F(i0,j0) = 0 by backtracking. Thus, we obtained the local
alignment of the PWM with sequence S1 and its coordinates (i0, imax).
Int. J. Mol. Sci. 2023, 24, 10964
14 of 17
4.3. Search for Similarities to the PWM in Sequence S
In sequence S, which was searched for dispersed repeats, we determined vector
Fmax(t) (t = 1, 11, ..., L-L1-49) (step 3, Figure 2) using the PWM from Section 4.2. For each
point t in sequence S, we calculated the coordinates of the beginning and end of local
alignment y0(t) = imax and ymax (t) = imax and then searched for local maxima in vector Fmax(t),
which was found at position t if Fmax(t+i) < Fmax(t) for all i from t-L1-49 to t + L1 + 49. Next,
we selected the local maxima for which Z = (Fmax− Fmax )/σ ≥ 5.0 and denoted the number
of local maxima found in sequence S as Nlm. For all found local maxima, the average Z was
calculated as: Z =
k = Nlm
∑ Z (k ) / N
k =1
lm
.
Below, we show that the threshold value of Z > 5.0 provides about 6% of false
positives for the first family of dispersed repeats found in the E. coli genome by this
method, and that for the other two families the number of false positives was about 15%.
As a result, we constructed local alignment of sequences S1 and S2 (columnar sequence of
the PWM matrix) for each selected local maximum.
4.4. Creating a New PWM Based on the Found Similarities
Based on the obtained local maxima, we created a new PWM (Figure 2, step 4). For
this, we used all local alignments found near the local maxima selected in Section 4.3, all
of which had Z ≥ 5.0. These local alignments contained fragments of sequences S1 and S2
(Section 4.2); the former representing a nucleotide sequence in the numeric code and the
latter representing PWM column numbers. Using sequences S1 and S2, we filled in
frequency matrix M(16, 600) as L1 = 600 (Section 4.1). M(n,s2(i)) = M(n,s2(i)) + 1 for all i from
2 to L1 (n = s1(i−1) + 4s1(i)). Then, we calculated matrix of normal arguments M1(16,800) as:
M 1 ( i, j ) =
=
p ( i, j ) x ( i ) y ( j ) / ( L − 1) 2
where
M ( i, j ) − ( L − 1) p (i, j )
L1
,
x(i ) = ∑ M (i, j )
j =1
16
(5)
( N − 1) p (i, j )(1 − p (i, j ))
16
,
y ( j ) = ∑ M (i, j )
,
and
i =1
L1
N = ∑∑ M (i, j ) . After the transformation of matrix M1 as described in Section 4.1, we
=i 1 =j 1
obtained a PWM which could be used in Section 4.2.
4.5. Selection of the PWM to Find the Greatest Number of Similarities with Sequence S
To find a PWM(j) with the maximum value of Z , the procedures described in Sections
4.2–4.4 were repeated i times (i = 1–20; step 5, Figure 2). The aim of these iterations was to
find a PWM(i) with the maximum i value. The search was performed for i =1, 2, ..., 20,
denoted as imax. As a result of iterations i = 1, 2, ..., 20, we memorized PWM(j) = PWM(imax),
all alignments found for imax, their coordinates in sequences S1 and S2 (Section 4.3), and Z for
each alignment. Then, the procedures described in Sections 4.1–4.5 were repeated j times.
4.6. Creating a Family of Dispersed Repeats
The procedures in Sections 4.1–4.5 were repeated 50 times, which means that index j
varied from 1 to 50. Then, we chose the jmax at which the maximum value (imax) was obtained
(step 6, Figure 2) and obtained the first family of dispersed repeats. Thus, for each repeat
family, we created PWM(jmax) and all the alignments found for jmax, obtained their
coordinates in sequences S1 and S2 (Section 2.3), and determined Z for each alignment.
After creating the first family of repeats, we replaced the sequences of the found
repeats in S1 with N, repeated the calculations described in Sections 4.1–4.6, and
constructed the next family of dispersed repeats.
5. Conclusions
Int. J. Mol. Sci. 2023, 24, 10964
15 of 17
We have developed a new mathematical method that allows identification of
dispersed repeats with the average number of substitutions per nucleotide x ≤ 1.5, which
is higher than that for any currently existing program. We have shown that all previously
developed methods and algorithms (RED, RECON, and some others) can only find
dispersed repeats for x ≤ 1.0. The new IP method has made it possible to detect families of
dispersed repeats in bacterial genomes which have not been previously reported. We
identify three families of approximately 1.09 × 106, 0.64 × 106, and 0.58 × 106 DNA bases,
respectively, constituting almost 50% of the complete E. coli genome. The length of the
repeats is in the range of 400 to 600 bp. Other analyzed bacterial genomes contain one to
three families of dispersed repeats with a total number of 103 to 6 × 103 copies. The
existence of such highly divergent repeats could be associated with the presence of a
single-type triplet periodicity in various genes or with the packing of bacterial DNA into
a nucleoid. The method can also be applied for the search of dispersed repeats in
eukaryotic genomes. We have created a web site for the analysis of bacterial genomes,
where users can enter a genome sequence and obtain the result in a reasonable time.
Supplementary Materials: The supporting information
https://www.mdpi.com/article/10.3390/ijms241310964/s1.
can
be
downloaded
at:
Author Contributions: Conceptualization, E.K.; methodology, E.K.; software, D.K. and M.K.;
validation, Y.S.; formal analysis, Y.S.; investigation, E.K., Y.S. and D.K.; resources, D.K.; data
curation, D.K.; writing—original draft preparation, E.K.; writing—review and editing, E.K. and Y.S.;
visualization, D.K.; supervision, E.K. and M.K.; project administration, E.K.; funding acquisition,
Y.S. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding
Institutional Review Board Statement: Not applicable
Informed Consent Statement: Not applicable
Data Availability Statement: The search for dispersed repeats in bacterial genomes can be
performed on the website: http://victoria.biengi.ac.ru/shddr, accessed on 29 June 2023.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Smit, A.F.A. The origin of interspersed repeats in the human genome. Curr. Opin. Genet. Dev. 1996, 6, 743–748.
https://doi.org/10.1016/S0959-437X(96)80030-X.
Mayer, K.F.X.; Waugh, R.; Langridge, P.; Close, T.J.; Wise, R.P.; Graner, A.; Matsumoto, T.; Sato, K.; Schulman, A.; Ariyadasa,
R.; et al. A physical, genetic and functional sequence assembly of the barley genome. Nature 2012, 491, 711–716.
https://doi.org/10.1038/NATURE11543.
Meyer, A.; Schloissnig, S.; Franchini, P.; Du, K.; Woltering, J.M.; Irisarri, I.; Wong, W.Y.; Nowoshilow, S.; Kneitz, S.; Kawaguchi,
A.; et al. Giant lungfish genome elucidates the conquest of land by vertebrates. Nature 2021, 590, 284–289.
https://doi.org/10.1038/S41586-021-03198-8.
Gupta, P.K. Earth Biogenome Project: Present status and future plans: (Trends in Genetics 38:8 p: 811-820, 2022). Trends Genet.
2023, 39, 167. https://doi.org/10.1016/J.TIG.2022.08.001.
Storer, J.M.; Hubley, R.; Rosen, J.; Smit, A.F.A. Methodologies for the De novo Discovery of Transposable Element Families.
Genes 2022, 13, 709. https://doi.org/10.3390/GENES13040709.
Tempel, S. Using and understanding repeatMasker. Methods Mol. Biol. 2012, 859, 29–51. https://doi.org/10.1007/978-1-61779-6036_2.
Jurka, J.; Klonowski, P.; Dagman, V.; Pelton, P. CENSOR—A program for identification and elimination of repetitive elements
from DNA sequences. Comput. Chem. 1996, 20, 119–121.
Bedell, J.A.; Korf, I.; Gish, W. MaskerAid : A performance enhancement to RepeatMasker. Bioinformatics 2000, 16, 1040–1041.
https://doi.org/10.1093/bioinformatics/16.11.1040.
Bao, W.; Kojima, K.K.; Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 2015,
6, 11. https://doi.org/10.1186/s13100-015-0041-9.
Girgis, H.Z. Red: An intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale. BMC Bioinform. 2015,
16, 227.
Bao, Z.; Eddy, S.R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002, 12,
1269–1276. https://doi.org/10.1101/gr.88502.
Int. J. Mol. Sci. 2023, 24, 10964
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
16 of 17
Edgar, R.C.; Myers, E.W. PILER: Identification and classification of genomic repeats. Bioinformatics 2005, 21 (Suppl. S1), i152–
i158. https://doi.org/10.1093/BIOINFORMATICS/BTI1003.
Price, A.L.; Jones, N.C.; Pevzner, P.A. De novo identification of repeat families in large genomes. Bioinformatics 2005, 21 (Suppl.
S1), i351–i358. https://doi.org/10.1093/bioinformatics/bti1018.
Volfovsky, N.; Haas, B.J.; Salzberg, S.L. A clustering method for repeat analysis in DNA sequences. Genome Biol. 2001, 2, 0027.1.
https://doi.org/10.1186/GB-2001-2-8-RESEARCH0027.
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A
new
generation
of
protein
database
search
programs.
Nucleic
Acids
Res.
1997,
25,
3389–3402.
https://doi.org/10.1093/nar/25.17.3389.
Mount, D.W. Using a FASTA Sequence Database Similarity Search. CSH Protoc. 2007, 2007, pdb.top16.
Tamura, K.; Peterson, D.; Peterson, N.; Stecher, G.; Nei, M.; Kumar, S. MEGA5: Molecular evolutionary genetics analysis using
maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol. Biol. Evol. 2011, 28, 2731–2739.
https://doi.org/10.1093/MOLBEV/MSR121.
Wheeler, T.J.; Eddy, S.R. Nhmmer: DNA homology search with profile HMMs. Bioinformatics 2013, 29, 2487–2489.
https://doi.org/10.1093/bioinformatics/btt403.
Notredame, C.; Higgins, D.G.; Heringa, J. T-coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol.
Biol. 2000, 302, 205–217. https://doi.org/10.1006/JMBI.2000.4042.
Edgar, R.C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32, 1792–
1797. https://doi.org/10.1093/nar/gkh340.
Korotkov, E.V.; Suvorova, Y.M.; Kostenko, D.O.; Korotkova, M.A. Multiple alignment of promoter sequences from the
arabidopsis thaliana l. Genome. Genes 2021, 12, 135. https://doi.org/10.3390/genes12020135.
Blattner, F.R.; Plunkett, G.; Bloch, C.A.; Perna, N.T.; Burland, V.; Riley, M.; Collado-Vides, J.; Glasner, J.D.; Rode, C.K.; Mayhew,
G.F.; et al. The complete genome sequence of Escherichia coli K-12. Science 1997, 277, 1453–1462.
https://doi.org/10.1126/SCIENCE.277.5331.1453.
Kostenko, D.O.; Korotkov, E.V.; Kostenko, D.O.; Korotkov, E.V. Application of the MAHDS Method for Multiple Alignment of
Highly Diverged Amino Acid Sequences. Int. J. Mol. Sci. 2022, 23, 3764. https://doi.org/10.3390/IJMS23073764.
Verma, S.C.; Qian, Z.; Adhya, S.L. Architecture of the Escherichia coli nucleoid. PLoS Genet. 2019, 15, e1008456.
https://doi.org/10.1371/JOURNAL.PGEN.1008456.
Suvorova, Y.M.; Kamionskaya, A.M.; Korotkov, E.V. Search for SINE repeats in the rice genome using correlation-based
position weight matrices. BMC Bioinform. 2021, 22, 42. https://doi.org/10.1186/s12859-021-03977-0.
Frenkel, F.E.E.; Korotkov, E.V. V Classification analysis of triplet periodicity in protein-coding regions of genes. Gene 2008, 421,
52–60. https://doi.org/10.1016/j.gene.2008.06.012.
Suvorova, Y.M.; Korotkov, E.V. Study of triplet periodicity differences inside and between genomes. Stat. Appl. Genet. Mol. Biol.
2015, 14, 113–123. https://doi.org/10.1515/sagmb-2013-0063.
Kahramanoglou, C.; Seshasayee, A.S.N.; Prieto, A.I.; Ibberson, D.; Schmidt, S.; Zimmermann, J.; Benes, V.; Fraser, G.M.;
Luscombe, N.M. Direct and indirect effects of H-NS and Fis on global gene expression control in Escherichia coli. Nucleic Acids
Res. 2011, 39, 2073–2091. https://doi.org/10.1093/NAR/GKQ934.
Prieto, A.I.; Kahramanoglou, C.; Ali, R.M.; Fraser, G.M.; Seshasayee, A.S.N.; Luscombe, N.M. Genomic analysis of DNA binding
and gene regulation by homologous nucleoid-associated proteins IHF and HU in Escherichia coli K12. Nucleic Acids Res. 2012,
40, 3524–3537. https://doi.org/10.1093/NAR/GKR1236.
Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842.
https://doi.org/10.1093/bioinformatics/btq033.
Trotta, E. The 3-Base Periodicity and Codon Usage of Coding Sequences Are Correlated with Gene Expression at the Level of
Transcription Elongation. PLoS ONE 2011, 6, e21590. https://doi.org/10.1371/JOURNAL.PONE.0021590.
Sánchez, J.; López-Villaseñor, I. A simple model to explain three-base periodicity in coding DNA. FEBS Lett. 2006, 580, 6413–
6422. https://doi.org/10.1016/J.FEBSLET.2006.10.056.
Großmann, P.; Lück, A.; Kaleta, C. Model-based genome-wide determination of RNA chain elongation rates in Escherichia coli.
Sci. Rep. 2017, 7, 1–11. https://doi.org/10.1038/s41598-017-17408-9.
Yevdokimov, Y.M.; Salyanov, V.I.; Nechipurenko, Y.D.; Skuridin, S.G.; Zakharov, M.A.; Spener, F.; Palumbo, M. Molecular
Constructions (Superstructures) with Adjustable Properties Based on Double-Stranded Nucleic Acids. Mol. Biol. 2003, 37, 293–
306. https://doi.org/10.1023/A:1023358008003/METRICS.
Yevdokimov, Y.M.; Salyanov, V.I.; Skuridin, S.G. From liquid crystals to DNA nanoconstructions. Mol. Biol. 2009, 43, 284–300.
https://doi.org/10.1134/S0026893309020113/METRICS.
Skuridin, S.G.; Vereshchagin, F.V.; Salyanov, V.I.; Chulkov, D.P.; Kompanets, O.N.; Yevdokimov, Y.M. Ordering of doublestranded DNA molecules in a cholesteric liquid-crystalline phase and in dispersion particles of this phase. Mol. Biol. 2016, 50,
783–790. https://doi.org/10.1134/S0026893316040129/METRICS.
Int. J. Mol. Sci. 2023, 24, 10964
37.
38.
17 of 17
Pugacheva, V.; Korotkov, A.; Korotkov, E. Search of latent periodicity in amino acid sequences by means of genetic algorithm
and dynamic programming. Stat. Appl. Genet. Mol. Biol. 2016, 15, 381–400. https://doi.org/10.1515/sagmb-2015-0079.
Korotkov, E.V.; Suvorova, Y.M.; Nezhdanova, A.V.; Gaidukova, S.E.; Yakovleva, I.V.; Kamionskaya, A.M.; Korotkova, M.A.
Mathematical Algorithm for Identification of Eukaryotic Promoter Sequences. Symmetry 2021, 13, 917.
https://doi.org/10.3390/SYM13060917.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury
to people or property resulting from any ideas, methods, instructions or products referred to in the content.