UNIVERSITY OF ENGINEERING AND TECHNOLOGY LAHORE, NAROWAL CAMPUS
Assignment #2
Submitted to:
Mam Fatima Shahzadi
Submitted by:
Muhammad Ijaz Ahmad
Registration No#:
2023-BME-143
Course Name:
Introduction to computer programming for Data Science
Course Code:
CS-103
Department of Biomedical Engineering
Code is:
import numpy as np
import matplotlib.pyplot as plt
import random
# Function to generate a random DNA sequence
def generate_random_dna_sequence(length):
nucleotides = ['A', 'T', 'C', 'G']
return ''.join(random.choices(nucleotides, k=length))
# Function to generate a DNA profile in FASTA format for an individual
def generate_dna_profile(name, sequence):
fasta_format = f">{name}\n{sequence}\n"
return fasta_format
# Number of individuals and length of DNA sequence
num_individuals = 50
sequence_length = 100
# Generate DNA profiles for different individuals
dna_profiles = {}
for i in range(num_individuals):
name = f"Individual_{i+1}"
sequence = generate_random_dna_sequence(sequence_length)
dna_profiles[name] = sequence
# Write DNA profiles to a FASTA file
fasta_file = "dna_profiles.fasta"
with open(fasta_file, 'w') as file:
for name, sequence in dna_profiles.items():
file.write(generate_dna_profile(name, sequence))
print("DNA profiles have been generated and saved in 'dna_profiles.fasta'.")
# Data Loading and Preprocessing
dna_profiles_loaded = {}
with open(fasta_file, 'r') as file:
lines = file.readlines()
for i in range(0, len(lines), 2):
name = lines[i].strip()[1:]
sequence = lines[i+1].strip()
dna_profiles_loaded[name] = sequence
# Encode DNA sequences into numerical arrays
alphabet = ['A', 'T', 'C', 'G']
encoded_profiles = np.array([[alphabet.index(base) for base in sequence] for sequence in
dna_profiles_loaded.values()])
# Generate DNA sample for crime scene simulation
crime_scene_sample = generate_random_dna_sequence(sequence_length)
# Encode the crime scene sample into a numerical array
encoded_crime_scene_sample = np.array([alphabet.index(base) for base in crime_scene_sample])
# Sequence Comparison
similarity_scores = [np.sum(encoded_crime_scene_sample == profile) for profile in encoded_profiles]
# Plot the result on a histogram
plt.figure(figsize=(10, 6))
plt.hist(similarity_scores, bins=np.arange(sequence_length+1), color='skyblue', edgecolor='black')
plt.xlabel('Similarity Score')
plt.ylabel('Frequency')
plt.title('DNA Profile Similarity to Crime Scene Sample')
plt.show()
Output is:
>Individual_1
ATACAGGTCCCGTCAGAGAGTTCGCAATGCATCACATGAGAAACGTGGATTCGCATTCTGGCCATAAGATGGGGA
TACGCGAGAGATCACCCCGAATGAA
>Individual_2
CAACATGTAAGTGCAACTGGACCTGAGCGAGAAGTCGGTAGATCTGACAACCCAATCAGTCGTGCCCCAGACTCA
CAAACCCTAGGCCAGGCGGCGGTGA
>Individual_3
AAATACGCGCTGGTATTGCTTGGTATAGGCTTTGCGAGGACTCAAAGTTTTTTCGAGAGTTGCGGAGCATCCGCTT
GGTGCTCAATCGTAATACTTTTCC
>Individual_4
GAGTTAGACCACCGAGGTATCTCTGATACGTGAGAGATCTTAAACCGCGTTCCTGGGGGGTGACAGACTTCAGAG
ACTAGTAAGCAGACGCAGTCCGCGG
>Individual_5
ACCCGATGGAGTCGCGGGGCGGCCCCGGCGCACTGCGGCGGCTCCAAAGAGCCCCGGTCAGGGCAATGATTGACC
ATAGACTGGGTGCGTGGTACGTTCC
>Individual_6
CGGATACACGCCATACCTCAGTAAAAAGTTATGGCGATAAAATAAAGTGTACCCCTTTCCATCTTGTTACTCGCAA
GTTCCTTGAGGGAAAAAATACACT
>Individual_7
TGCCTCGCCCGCTATGCTAATAAGATTTGGCCCCCACAAGCTCATGTCTATCGTGAGGAACATCATCTGTTAGGAC
CGCACGCAAGAGGATATCCAAGAT
>Individual_8
CTCGTGATTTTATTAGTCTTATCCACTCAGGCTTTTGAGTATTTATTTAGGGTACCACGCGCTGCAGAGTTATTCCTG
AATCTAGCACGGTTTCAAGGAG
>Individual_9
CTTGCTCCAATAACATGTTGGCAACTAAATGAACCCGGAAAGCGCTTCTTGGCAGGGAGGGATTAAGGACTAACA
GCCTACATGATCCGAAACGGTTAAG
>Individual_10
GGTCTCGTCAGTTGACCGCCATCAACCCGGGTACTTAACACTTTCCGCGAACACTATGCACTGATCAATTGACCCA
TCTAGAGTGTACCAGATTTGAAAT
>Individual_11
CTGTTTTGTCGAACAAGAACAGTATAGGGCACCAACGAAGGCGACCAAGGGCGGCGGCCCTTCACGTATACACCC
AAGCACCGGAAGTTTAGTTCAGGGA
>Individual_12
AATACTGCCAGTCGGCTGGTTCGCGTTCTATAATTCTAAGACATGATAACCCACGAGAGGTTTTAACGGGCGTGGG
AACATCCAGGTATCGGACCCCTGG
>Individual_13
CCGCTTTCCGACGGAAGATCTAAAGTAAAACCCCTTCGATTCATGATGCTGATCCAAGCTACAAACTGACGTCACA
GCCGCTAGGGGAGGACTAACGGTT
>Individual_14
CATCCGTAACGCGTGGAACCGCAATGGTATTTTAGGGCGTTTAGCTAAATAGCAATTATGGCTGCGTACAATCAGT
TGTGCCGATAGTTCCAATGGCGTG
>Individual_15
TTTCGAGATACGTACCGAAATCGACGCTCTATTGCGTTTCAGTATGTGCCCTGTTCTGGGAAGCTATATCGACTAAA
TGTAGCGACATAAGATGTAACCG
>Individual_16
TTTTAACGGGGGAAGTACTCACCCGGACTTAGGATGCGATACAGGGGGGGATTCATGTTTCTAACCATGAGCGGTC
ACGTGTTGGACTGAGGGTAGGCCC
>Individual_17
GTACTGAGGCAATGGCCGCGTCCCCGTGCTCGACAGTTGAGCAGCAACATGCGTTTGAATCTGTGAAACATCTTGT
TTACGATACTGTTACTTATACCTA
>Individual_18
CGTTGACCTTTTCAGGATCCGTTCACGCTGGGTTTTAACGTGCGCGCTTTATAATGTGGAAGGCGGGGGGGAGTCC
CACACGAGCTATTTACTCCAACCT
>Individual_19
GGAGCGTCACGTCTATAGCTTATGTCCTACTGGGGTCGGCCCATGAAAGAGGAGTATCTATGCGTTCTATGGTAGA
TTCCACGCTAAGACTCGGCCATCA
>Individual_20
GCGTTCCTGTTTCTCCCCTCGACTGGGTAATGGAGCCGACCTGCGAGCTCACGTTATCTTAAAATGAGCTGGTTCCC
AGTACTAAGTCCGGCCGACGTGT
>Individual_21
CCACTGCAAGCAGTTCAAATGCTACCGTGGGAATCGGCACATTTTAGGAGAACTTTTGTGACATTTCGAGACTGTC
AAGCACCCACTGCAAATCAATTAT
>Individual_22
ATAAGTTTACGGAGCAATAGCGTCTACAAAAAATTCAATTGTGCAGGCCCGGTGAACCTCTAGTCGTACATGAACG
CAACGGGTATTGAAGCCGCAAGAT
>Individual_23
TCGATTCGCTATGCCCCATCCACCTCGCAAATAGTCGCGTCCTCGTATACTTACCTTAGACTACACGGAGGTCTAAC
CGTTACGACGTAGGCAATTCGTG
>Individual_24
CGCCCCTCGCTGTGTTTGTCAATAGGTAATTTTTTGAGAACCAACGGCCTTATACTATCTCGGCTATCACACGGTAG
AAGTGGAACTCACACCAGGAGTC
>Individual_25
TGACAGCTCCCCAGAAGAGACACCAGATGTCTATCTAAGTTGTTTAGTCTGTGCTAGTTCTACCAGTAAACATGAT
TGAAAAAACTATCTTATTCTTCTG
>Individual_26
TAATTTGCTGCGCGCGTGCCCCTCTACGTCGGATAAGTAATCAGATGGTACTCCAAGCGTAAATCACTTCTCCATTT
CTACCTTGGGGTCTGATATAGTC
>Individual_27
ATATGAATCAGCTGTTCTGGCTTGTAGTAACGAGGGGCCATATGGAAAGTATCCTGCCAAACGGCAGGTAGAAAT
CACAGCGTCCGAGCTTACCATGGTT
>Individual_28
TAGTTCAGACGGAAGGGGGGAAACTCCAAGGGTCCGCAAGTCCAAAATGAGCACGCATGCCCAACTATACTGCAC
ATAAACTCATGTTTCGCGCGTTCGC
>Individual_29
AACAGACGTTTGCTAAAAGTGCATAAAGTCGGACCGCGCTGATTTAGTACGCAGGCCGGAATGAGACATAACAGG
ACTACAAGACTCTACAACCCGAGAT
>Individual_30
CAGTTTCAATAAGATAAGGCCAGTCTAATGGAGAATGAATCTGCAACTCCCTAAAGAAGTGGTGGCGCACGTCCG
TGGAGGCCAGCGCCCTAGAGATATC
>Individual_31
TTCTCCCCTAGCAGTAAATTCTTCAGACACCAGCTGGTACCTTAGTGAAGTAAAAGTGAACTTTCATTTGGTTACAG
CCCGGCAAGGATATACGGCTGAG
>Individual_32
AACTATATAGGAAGTCTCAGCCACCACAAAAGTTACGGGCAGCGGGGGTGCTCCGTCAAGTCCGAGAAGGCGAAT
AGCGCTGATAATTTATGTCACACTC
>Individual_33
AGGGAGGAAGTCCGAGAAAACAGTAATAAATACCTCGGGCGTAATAGATAAAGTACAAGCAATTGTCGTTAGTCA
ATCGATTGCGTGGTGGAGACTGCCG
>Individual_34
AGCTGGCCGGGCATGCGTTGCGTGGCCAAGCTTAGTACTCATGTAGCCGACGGAGCCTCATCAAACTGAGCGTCA
ATTAGTTGGAGGGTTGTAGTTAATA
>Individual_35
AGGTAGAGCTCACACAGGTATAAAGTGCTCAGTCAAAGGCAGGCCATAATCGCGGACGATTAATACCCATATCCA
TGCGAGTCCGTGGAGAGATCGTACA
>Individual_36
CAAATCTATGGGCCACCTAGCATAGCACACTGAAGAACGCGCATAAAGGTAGATACTAAGCGTTTATGGGATGTT
TTCGGGTTAGCGGCTAACTCATAAA
>Individual_37
GACTTCGAGCAAGACGCACTTAATCGATAATTGCCGCCTGTTTGGGTGCTCGATACATTGGGTCACACGCCTCCAT
GGGGAGTCGTGAGCGAAGGTCTGG
>Individual_38
CCCGATATTGGTTTTAAGCTCCACCCTTCACACGAGGTGCTAGAACCGTAGAGTTCCCTCGTATAGAGCTTTAAGT
GATTGAGATTAGGTGGAGGATTCC
>Individual_39
TTGGCTATCTGGACTTTATCTAAGTCTGAACGCTATCTAGATTGTATTTGCGCGGATTGAATGCAATCTCCGAATAG
ATGGCGCAGTGGGCAAAATACCC
>Individual_40
AGTACCCAAAATGTACGCCGACCACACCAGAAAAGTTTAGACTTGTTATCAGATACTTCGCAGGATTGGTGGGAC
GACACTCGCACGTTGTAATTTCTCT
>Individual_41
ATCTGATCATATTTTTGACTGTGAACGTATATACGTCCAGTCGCTTGGTTTATATTCTCCACGTCGCTTTGGCTCCAC
AGGCCGACTCGTAGTTCGGTGT
>Individual_42
CTTGGTTGTGGTCCGCCAAGGACATCTCACGCTCCAAGAGATGGGACTGGCCAATAGGCGCAGACAAAACTTCTC
CACCTCCAGCTTAAGGCAGGATCTT
>Individual_43
GGTACCAAGTTATCGTGTGTGTTCGCTAGGAAATTAGTTTTTAGTCATCGACAGGATTGCGTCCTACACTATTATGA
AAACAAGTCATAGGGCATAGCAC
>Individual_44
GATCCACCAATACTACCCATAGTATCAGTCCAACGCTCTCCCACAACGCGGGCTTGGGGGCACTTACGAGAGGGG
GTGGAACTAGGGGCTATCCTCGAAA
>Individual_45
TCCGCTGATAAATAACGACGTCAAGAAAAGCTCTGAAATATCCCTAAGTATTAATATGTTTTTGAGCTCTACGGCC
ATCAGTCACCCTACTAGCGTAGTC
>Individual_46
CTTCCCAAAAACAAACCTGTAAAGGCTCTTTCCGCTATAAGCGGATCTTCTTAAGAATTGTTCAGTCGCGCTTCACC
ATCTGCTGTGTATTACCAAAGTC
>Individual_47
TCTATACAATAACTGTGGCGTAGCAATACACTGACAAGCGGCTTTTTATGTAGCGGTCGGGCTTCCCTACTCAGAT
TGCCAGTAAATTACGGTCGTCTGT
>Individual_48
CTTGCATTCGCGCTTGTATGCTCTCTGTACTGATGAGAAAGGTCCCATCCTAGTGGCGCGGTACGTGTTTTCAGCCG
AAAAGGCCGTAACGTGGGGACGA
>Individual_49
TAGAGGCCCAGCAAACGTACTCCCTGTAGGTTACTGACGGACTCGCGTATGGCGATGCTTTGTAGGCATATTCCGT
AATCTTGATTAGAGTTGGTAAATA
>Individual_50
CAGAACATGAAACGGCGGCCAGAGAACGTACGGTCGCTTAGCAGACGCCCAGTTGGATTGTCGTAGAAACAGACG
CGCGATGTCCTTAGCCACAGGTTAC
Graph is: