Assignment 1
Assignment 1
Assignment 1
1- Write a program that outputs a list of indices of ALL occurrences of a substring in a sequence.
ACTGATCGAATTCGTATAGTAGAATTCTATCATACAGAATTCTATATATCGATGCGGAATTCTTCAT
The sequence contains a recognition site for the EcoRI restriction enzyme, which cuts at the motif
G*AATTC (the position of the cut is indicated by an asterisk). Write a program which will output and
calculate the sizes of all the fragments that will be produced when the DNA sequence is digested with
EcoRI.
3- Write a program to ask for a DNA sequence. Translate the DNA into protein. When the codon doesn’t
code for anything (eg, stop codon), use “*”. Ignore the extra bases if the sequence length is not a
multiple of 3. Use the file “table.txt” for the translation.
"CCGGAACCGACCATTGATGAG"
4- Write a program to find the longest common subsequence (not a substring) in a list of sequences.
6- Given two Sequences S and T of equal length, the Hamming distance between S and T, denoted by
H(S,T), is the number of corresponding symbols that differ in S and T. Write a program that takes two
sequences and calculate the hamming distance between them.
7- Given an integer k, we define the frequency array of a string Text as an array of length 4k, where the i-
th element of the array holds the number of times that the i-th k-mer (in the lexicographic order)
appears in Text. Computing a Frequency Array
A:
I. Write a program that build a dictionary contains all possible trinucleotides and their occurrences from
a DNA sequence.
Input:
sequence = "AATGATCGATCGTACGCTGA"
II. From the dictionary, retrieve the occurrence of these examples: ('CGA','AAT', 'TCG')
Hint: Assume that there is one reading frame that starts from the first nucleotide
Output:
CGA 1
AAT 1
TCG 2
B:
Input:
sequence = "AATGATCGATCGTACGCTGA"
Output:
s is AATGATCGATCGTACGCTGA
CGA 1
AAT 1
TCG 2
s is ATGATCGATCGTACGCTGA
s is TGATCGATCGTACGCTGA
TCG 2
9- Given an RNA string ss, we will augment the bonding graph of ss by adding base pair edges connecting
all occurrences of 'U' to all occurrences of 'G' in order to represent possible wobble base pairs. We say
that a matching in the bonding graph for ss is valid if it is noncrossing (to prevent pseudoknots) and has
the property that a base pair edge in the matching cannot connect symbols sjsj and sksk unless
k≥j+4k≥j+4 (to prevent nearby nucleotides from base pairing).
See Figure 1 for an example of a valid matching if we allow wobble base pairs. In this problem, we will
wish to count all possible valid matchings in a given bonding graph.
see Figure 2 for all possible valid matchings in a small bonding graph, assuming that we allow wobble
base pairing.
Return: The total number of distinct valid matchings of base pair edges in the bonding graph of ss.
Assume that wobble base pairing is allowed.
Input:
AUGCUAGUACGGAGCGAGUCUAGCGAGCGAUGUCGUGAGUACUAUAUAUGCGCAUAAGCCACGU
Output:
284850219977421