BIOINFORMATICS (FINAL)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

1

BIOINFORMATICS

PAPER CODE: DSE 2

SUBMITTED BY-
NAME: DIP DAS
ROLL NO: BTN/20/711
DEPARTMENT: BIOTECHNOLOGY
ST. XAVIER’S COLLEGE, BURDWAN
ACADEMIC YEAR: 2022-2023
2

CONTENTS
SL. Topic Page Date Teacher’s
No. No. Signature
1 Introduction to bioinformatics. 3-4 08.08.2022

2 Sequence Information Resource 5 08.08.2022

3 Protein Information Resource 6-7 08.08.2022

4 Searching a particular Protein through 8-13 17.08.2022


NCBI
5 Searching a particular Protein through 14-16 17.08.2022
Swiss-Prot and TrEMBL
6 Understanding and using PDB 17-20 24.08.2022

7 Searching a particular nucleotide through 21-24 16.09.2022


NCBI
8 Searching a particular genome through 25-28 18.10.2022
NCBI
9 BLAST with single sequence of 29-33 21.10.2022
nucleotide

10 BLAST with double sequence of 34-36 21.10.2022


nucleotide
11 Multiple Sequence Alignment using 37-41 01.12.2022
clustal omega
3

INTRODUCTION TO BIOINFORMATICS

Bioinformatics is an interdisciplinary research area at the inference between biological and computer
science. Most of the bioinformatics work is done with biological data, or with the organization of biological
information. As a consequence of the large amount of data produced in the field of molecular biology, most
of the current bioinformatics project deals with structural and functional aspects of gene or protein.
Computational biology and bioinformatics are multidisciplinary fields, involving research from different
areas of speciality, including statistics, computer science, physics, biochemistry, molecular biology, and
mathematics. The goal of these two field is as follows:
• Bioinformatics typically refers to the field concerned with the collection and storage of biological
information. All matter concern with biological databases are considered bioinformatics. Research,
development, or applications of computational tools and approaches for expanding the use of
biological medical behavioural or health data, including those to acquire, store, organize, archive,
analyse or visualize such data.

• Computational biology: Refers to the aspect of developing algorithm and statistical mode necessary to
analyse biological data through aid of computer. The development and application of data analytical
and theoretical method, mathematical modelling and computational stimulation techniques to the study
of biological, behavioural and social system.

AIM OF BIOINFORMATICS:
• The aims of bioinformatics are three folds.

• Firstly, at its simplest bioinformatics organises data in a way that allows research to access existing
information and to submit new entries as they are produced.

• Secondly, while the data curation is an essential task, the information stored in three databases is
essentially useless until analysed, thus the purpose of second aim is to develop tools and resources that aid
in the analysis of the data.

• Thirdly, the aim is to use these tools to analyse the data and interpret the result in a biologically
meaningful matter.
4

CURRENT RESEARCH IN BIOINFORMATICS:


I. Genomics: Genomics is the study if an organism’s genome.
II. Proteomics: It is defined as the study of the proteome. Proteome refers to entire set of expressed proteins in a
cell.
III. Computer aided drug signalling: Computer aided drug designing is a specialized discipline that uses
computational method to stimulate drug receptor interaction.
IV. Biological Data Mining: It is the discovery of useful knowledge of biological database. Data mining
employs algorithm and techniques from statistics, machine learning, artificial intelligence, databases and data
warehousing etc.
V. Microarray informatics: Microarray technology is a powerful tool to monitor gene expression or gene
expression changes of hundreds or thousands of genes in a single experiment.
VI. System Biology: It studies by systematically perturbing; monitoring the gene, protein, and information
pathway responses; integrating these data, and ultimately formulating mathematical model that describes the
structure of the system and its response to individual perturbation.
VII. Molecular Phylogenetics: it is the study of organism in a molecular level to gather information about the
phylogenetic relationship between different organism.
5

SEQUENCE INFORMATION RESOURCE

Database:
Databases are an e-book (book on the internet) in which we can store, search and retrieve (find out) any type of
data. Data can be DNA sequence, RNA sequence, protein sequence, genome information, proteome
information, transcriptome information etc. For e.g. if “ACGTCAAGA” is the gene sequence of a protein “X”;
then it can find out in a genomic database present on the internet. We can know that how many organisms in the
universe (discovered so far) are having this sequence or its related sequence for the protein X? What is the
amino acid composition and 3D structure of this protein in different organisms? What are the coding regions of
this gene? On which chromosome it is present etc. etc.
Types of databases:
A vast data was obtained from genome and peptide sequencing of various organisms which could not be stored
in a single database. Therefore, different types of databases were constructed. Presently three types of databases
are known:

I. Primary databases:

These are the primary storehouse of sequence information. If a new researcher works on a new organism, plant
or microbe and obtains its sequence information, which is new to the world, then he/she should submit his/her
data to a primary database. Every year large number of researchers, sequencing labs and industries contribute a
large amount of data to primary data repositories on the internet. All these databases are freely available on the
internet.
Primary databases can be of two types:

a) Nucleic acid databases – These are primary depot for nucleic acid sequences. Examples include
GenBank, NCBI, EMBL, DDBJ (DNA databank of Japan).
b) Protein databases– These are primary depot for protein sequences. Examples include SWISSPROT,
PIR, PDB

II. Secondary databases:

These databases combine the data of primary databases, add more relevant data and then re-publish this data on
the internet. TrEMBL, Pfam, PROSITE, CATH are some good examples.

III. Composite databases

In these databases, sequences from different databases are gathered altogether in a single database. So, by using
a composite database, the user becomes free from the tedious task of gathering information from multiple
sources. For e.g. HIV database contains collective information about HIV from different databases. It is
important to note that all composite databases have their own format; which is created by the inventors of the
particular database. OWL, MISPX, NRDB are e.g. of databases.

• Redundant database- When we find the same information in two different places in the database, it is called
redundant database. The question arises why same information will be present multiple times on the net? The
answer is same data is submitted by different researchers around the globe working on the same organism.
• Non-redundant database- Means only a single entry of a piece of data will be found. This data is not repeated
anywhere. Examples of non-redundant databases are OWL. Almost all known databases, including primary
databases, known in the world are non-redundant. When a particular sequence is submitted twice to the web,
from different users, then it is merged into a single database. After all, there is no use of redundancy.
6

PROTEIN INFORMATION RESOURCE

Introduction:
The accelerating pace of genome sequencing projects has greatly increased the volume and complexity of
available molecular data. To realize the fullest possible value from the data and to gain a better understanding of
the genome, databases and the computational tools for analysing them are required to allow biologically
relevant features in the sequences to be identified and to provide insight on their structure and function. For over
30 years, the Protein Information Resource (PIR) has been providing the scientific community with databases
and tools for the organization and analysis of protein sequence data (1,2). Together with MIPS and JIPID, we
have undertaken a major restructuring to meet the challenges presented by the rapid growth of largely
uncharacterized sequence data and the opportunities provided by the nearly universal access of scientists to the
resources available on the WWW. Among the key developments are complete protein family organization for
the PIR-International Protein Sequence Database (PSD) and integrated WWW interfaces for user-friendly
sequence analysis, database searching and information retrieval.

THE PIR-INTERNATIONAL PROTEIN DATABASES

• PIR, MIPS and JIPID constitute the PIR-International consortium that maintains the PIR-International
Protein Sequence Database (PSD), the largest publicly distributed and freely available protein sequence
database. The database has the following distinguishing features.

• It is a comprehensive, annotated, and non-redundant protein sequence database, containing over 142 000
sequences as of September 1999. Included are sequences from the completely sequenced genomes of 16
prokaryotes, six archaebacteria, 17 viruses and phage’s, >100 eukaryote organelles and Saccharomyces
cerevisiae.

• The collection is well organized with >99% of entries classified by protein family and >57% classified by
protein superfamily.

• PSD annotation includes concurrent cross-references to other sequence, structure, genomic and citation
databases, including the public nucleic acid sequence databases ENTREZ, MEDLINE, PDB, GDB,
OMIM, Fly Base, MIPS/Yeast, SGD/Yeast, MIPS/Arabidopsis and TIGR. Where these databases are
publicly and freely accessible
• and provide suitable WWW access, the cross-references presented on the PIR WWW site are hot-linked
so that searchers can consult the most current data.

• The PIR is the only sequence database to provide context cross-references between its own database
entries. These cross-references assist searchers in exploring relationships such as subunit associations in
molecular complexes, enzyme–substrate interactions, activation and regulation cascades, as well as in
browsing entries with shared features and annotations.

• Interim updates are made publicly available on a weekly basis, and full releases have been published
quarterly since 1984. In addition to the PSD, PIR-International distributes or provides WWW access to
other sequence and auxiliary databases (Table1), briefly described below, and maintains several internal
data collections used for sequence annotation and integrity checks.

• PATCHX (3) is a non-redundant database assembled by MIPS of publicly available protein sequences not
yet in the PIR-International PSD. PIR+PATCHX, a combination of the PSD and PATCHX containing
~300 000 sequences available for similarity searches, is the most complete non-redundant collection of
protein sequences available in the public domain.

• ARCHIVE is a database of protein sequences as originally reported in a publication or submission, the


only such collection of ‘as published’ unmerged sequences.
• NRL_3D (4) sequence-structure database is produced from sequence and annotation in the Protein Data
Bank (PDB) of three-dimensional structures.

• FAMBASE is a collection of representative sequences from each protein family that can be used in a
similarity search to reduce search time and improve sensitivity for identifying distant families.
7

• PIR-ALN (6) is a curated database of sequence alignments of super families’’, families and homology
domains, with annotation information derived from PSD and consensus patterns calculated from the
alignments.

• RESID (7) is a database of post-translational modifications with descriptive, chemical, structural and
bibliographic information based on feature information in the PSD.

• ProClass (8) is a protein family database that organizes non-redundant PIR-International PSD and
SWISS-PROT sequences according to PIR super families and PROSITE patterns.

• ProtFam (9) is a curated database of homology clusters with automatically generated multiple sequence
alignments for families, super families and homology domains.
• To support both data management and data mining and assist knowledge discovery, the PIR databases are
being migrated to an object-relational database management system. A three-tier network computing,
architecture provides a framework for distributed object computing and Java-based WWW interfaces
connect with the database server for database query and update tasks.
8

SEARCHING OF A PARTICULAR PROTEIN THROUGH NCBI

Procedure:

1. The NCBI homepage was opened.


2. The protein name (albumin) was given in the space provided for search.
3. From the hit list obtained the sequence of interest (albumin) was selected. The result for the sequence of
interest was selected.
4. The result for the sequence of interest was noted.
9
10
11
12
13
14

SEARCHING OF A PARTICULAR PROTEIN THROUGH SWISS-PROT


AND TrEMBL

Procedure:

1. The Uniprot homepage was opened.


2. Then UniProtKB section was selected.
3. The protein name (albumin) was given in the space provided for search.
4. From the hit list obtained the sequence of interest (albumin) was selected. The result for the sequence of
interest was selected.
5. The result for the sequence of interest was noted.
15
16

Fasta File format:


>sp|P49065|ALBU_RABIT Albumin OS=Oryctolagus cuniculus OX=9986 GN=ALB PE=1 SV=2
MKWVTFISLLFLFSSAYSRGVFRREAHKSEIAHRFNDVGEEHFIGLVLITFSQYLQKCPY
EEHAKLVKEVTDLAKACVADESAANCDKSLHDIFGDKICALPSLRDTYGDVADCCEKKEP
ERNECFLHHKDDKPDLPPFARPEADVLCKAFHDDEKAFFGHYLYEVARRHPYFYAPELLY
YAQKYKAILTECCEAADKGACLTPKLDALEGKSLISAAQERLRCASIQKFGDRAYKAWAL
VRLSQRFPKADFTDISKIVTDLTKVHKECCHGDLLECADDRADLAKYMCEHQETISSHLK
ECCDKPILEKAHCIYGLHNDETPAGLPAVAEEFVEDKDVCKNYEEAKDLFLGKFLYEYSR
RHPDYSVVLLLRLGKAYEATLKKCCATDDPHACYAKVLDEFQPLVDEPKNLVKQNCELYE
QLGDYNFQNALLVRYTKKVPQVSTPTLVEISRSLGKVGSKCCKHPEAERLPCVEDYLSVV
LNRLCVLHEKTPVSEKVTKCCSESLVDRRPCFSALGPDETYVPKEFNAETFTFHADICTL
PETERKIKKQTALVELVKHKPHATNDQLKTVVGEFTALLDKCCSAEDKEACFAVEGPKLV
ESSKATLG
17

UNDERSTANDING AND USING PDB

SEARCHING A PARTICULAR PROTEIN THROUGH PDB:

Procedure:
1. Go to PDB homepage.
2. The protein name (albumin) was entered in the search box.
3. From the hit list obtained, the sequence of interest (bovine serum albumin) was observed.
4. The structure of the protein was observed in various ways.
18
19
20
21

SEARCHING OF A PARTICULAR NUCLEOTIDE THROUGH NCBI

Procedure:

1. The NCBI homepage was opened.


2. ‘Nucleotide’ database was chosen.
3. The query was entered in the search box.
4. ‘Go’ option was clicked.
5. From the hit list obtained the sequence of interest.
22
23

Nucleotide Database:
ORGANISM: Gemmata massiliana

ACCESSION: LR593886

source 1..960
/organism="Gemmata massiliana"
/mol_type="genomic DNA"
/isolate="Soil9"
/isolation_source="soil"
/db_xref="taxon:1210884"
/chromosome="1"

CDS 1..960
/locus_tag="SOIL9_70140"
/note="BLAST_uniprot:hit_1;
ACCESSION=tr|W7W0Z7|W7W0Z7_9ACTO;
ALN/Q_length_ratio=0.969; DESCRIPTION=Chromosomal
replication initiator protein dnaA (Fragment)
OS=Micromonospora sp. M42 GN=MCBG_03933 PE=3 SV=1;
EVALUE=3e-36; Q/S_length_ratio=0.901;
BLAST_uniprot:hit_2; ACCESSION=tr|G8S340|G8S340_ACTS5;
ALN/Q_length_ratio=0.931; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Actinoplanes sp.
(strain ATCC 31044 / CBS 674.73 / SE50/110) GN=dnaA PE=3
SV=1; EVALUE=5e-36; Q/S_length_ratio=0.860;
BLAST_uniprot:hit_3; ACCESSION=tr|R2SGE0|R2SGE0_9ENTE;
ALN/Q_length_ratio=0.981; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Enterococcus asini
ATCC 700915 GN=dnaA PE=3 SV=1; EVALUE=4e-35;
Q/S_length_ratio=0.718;
BLAST_uniprot:hit_4; ACCESSION=tr|H4GIG8|H4GIG8_9LACO;
ALN/Q_length_ratio=1.016; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Lactobacillus
gastricus PS3 GN=dnaA PE=3 SV=1; EVALUE=1e-34;
Q/S_length_ratio=0.728;
BLAST_uniprot:hit_5;
ACCESSION=tr|A0A062XB61|A0A062XB61_9LACO;
ALN/Q_length_ratio=1.041; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Lactobacillus
animalis GN=dnaA PE=3 SV=1; EVALUE=2e-34;
Q/S_length_ratio=0.712;
BLAST_uniprot:hit_7; ACCESSION=tr|T2TC53|T2TC53_PEPDI;
ALN/Q_length_ratio=0.962; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Peptoclostridium
difficile CD9 GN=dnaA PE=3 SV=1; EVALUE=2e-34;
Q/S_length_ratio=0.715;
BLAST_uniprot:hit_8; ACCESSION=tr|K9IAN4|K9IAN4_9LACO;
ALN/Q_length_ratio=0.962; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Pediococcus lolii
NGRI 0510Q GN=dnaA PE=3 SV=1; EVALUE=2e-34;
Q/S_length_ratio=0.715;
BLAST_uniprot:hit_10; ACCESSION=tr|D2EH16|D2EH16_PEDAC;
ALN/Q_length_ratio=0.962; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Pediococcus
acidilactici 7_4 GN=dnaA PE=3 SV=1; EVALUE=2e-34;
Q/S_length_ratio=0.715;
BLAST_uniprot:hit_9; ACCESSION=tr|E0NG95|E0NG95_PEDAC;
ALN/Q_length_ratio=0.962; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Pediococcus
acidilactici DSM 20284 GN=dnaA PE=3 SV=1; EVALUE=2e-34;
Q/S_length_ratio=0.715;
BLAST_uniprot:hit_11; ACCESSION=tr|N1ZMJ3|N1ZMJ3_9LACO;
ALN/Q_length_ratio=1.041; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Lactobacillus
murinus ASF361 GN=dnaA PE=3 SV=1; EVALUE=2e-34;
Q/S_length_ratio=0.712;
BLAST_uniprot:hit_6; ACCESSION=tr|G6IMT3|G6IMT3_PEDAC;
ALN/Q_length_ratio=0.962; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Pediococcus
24

acidilactici MA18/5M GN=dnaA PE=3 SV=1; EVALUE=2e-34;


Q/S_length_ratio=0.715;
BLAST_uniprot:hit_12; ACCESSION=tr|S0KIZ7|S0KIZ7_9ENTE;
ALN/Q_length_ratio=0.981; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Enterococcus
columbae DSM 7374 = ATCC 51263 GN=dnaA PE=3 SV=1;
EVALUE=3e-34; Q/S_length_ratio=0.712;
BLAST_uniprot:hit_14; ACCESSION=tr|K5F1L6|K5F1L6_9LACO;
ALN/Q_length_ratio=0.959; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Lactobacillus florum
2F GN=dnaA PE=3 SV=1; EVALUE=3e-34;
Q/S_length_ratio=0.723;
BLAST_uniprot:hit_13; ACCESSION=tr|U4TR20|U4TR20_9LACO;
ALN/Q_length_ratio=0.978; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Lactobacillus
shenzhenensis LY-73 GN=dnaA PE=3 SV=1; EVALUE=3e-34;
Q/S_length_ratio=0.697;
BLAST_uniprot:hit_15; ACCESSION=tr|W9EEM9|W9EEM9_9LACO;
ALN/Q_length_ratio=0.959; DESCRIPTION=Chromosomal
replication initiator protein DnaA OS=Lactobacillus florum
8D GN=dnaA PE=3 SV=1; EVALUE=4e-34;
Q/S_length_ratio=0.723;
Pfam_scan:hit_1 (2..203); Pfam:PF00308.13:Bac_DnaA;
Pfam_type:Family;HMM_aln_Length:200; HMM_Length:219;
EVALUE:1.6e-17; BITSCORE: 63.8;
Pfam_scan:hit_2 (229..297); Pfam:PF08299.6:Bac_DnaA_C;
Pfam_type:Domain;HMM_aln_Length:68; HMM_Length:70;
EVALUE:2e-24; BITSCORE: 85.2;
GO_domain:GO:0003677;
GO_domain:GO:0043167;
GO_domain:GO:0009058;
GO_domain:GO:0006259;
GO_domain:GO:0005575"
/codon_start=1
/transl_table=11
/product="chromosomal replication initiator protein :
Chromosomal replication initiator protein dnaA (Fragment)
OS=Micromonospora sp. M42 GN=MCBG_03933 PE=3 SV=1:
Bac_DnaA: Bac_DnaA_C"
/protein_id="VTR90700.1"
/translation="MLPENRVAVRAVRSVYRSVVAGKRPGATPLVLHGPPGTGKSHLS
ASLAQRLSTSPNGVTVRVVSAGDVSRSPEESLTDDELTDCDLLALEDVQQLSEHKTDA
ACDLLDRRTARRRATVVTAHAGPSQLANLPHRLTSRLSAGLVVQLGPLTPASRRAILA
EVAIAKKVRLTDEALDWLSEQVTGGGVRATLGLLQNLAQVASAFPGPLTRADVQQTLA
ETGQPTSAPNDISRIVERVAAAFGVSEKELLGPSRLRSVLQSRQVAMYLARELMGLSL
PRLGAAFGRDHTTVLHACRKVETALTEDTELAKRVRDLRAVLA"
25

SEARCHING OF A PARTICULAR GENOME THROUGH NCBI

Procedure:

1. NCBI homepage was opened.

2. ‘Genome’ category was selected.


26

3. The organism “Helicobacter pylori” was searched for in the query box.

4. The organism details page was opened.


27

5. Dendogram was scrolled down to.

6. View was changed through Layout settings.


28

7. Optimal zoom settings can be adjusted to observe details like nodes and leaves.

8. Each node can be studied with distance, strain details and option to get further details.
29

BLAST WITH SINGLE SEQUENCE OF NUCLEOTIDE

Procedure:
1. The NCBI homepage was opened.

2. Nucleotide database was chosen.

3. The query was entered in search box.

4. From the hit list obtained, the sequence of interest.

5. The ‘Text’(default) was observed.

6. Select the BLASTN option.

7. A box will appear in the next page.

8. Sequence is paste in the box which is saved previously.

9. Then click on blast option.

10. After few times the result will appear.


30
31
32
33
34

BLAST WITH DOUBLE SEQUENCE OF NUCLEOTIDE

Procedure:
1. The BLAST homepage was opened.

2. The query accession number (nucleotide) for BLAST analysis was selected from NCBI.

3. The query accession number was uploaded in the space provided.

4. The program was allowed to run.

5. The result was obtained.

6. The sequence alignment and the similarity score value for query were checked.
35
36
37

MULTIPLE SEQUENCE ALIGNMENT USING CLUSTAL OMEGA

Procedure:
1. The Clustal Omega home page was opened.

2. The query sequences for multiple sequence analysis selected in FASTA format from Swiss-Prot
(UniprotKB) was uploaded in the space provided.

3. The program was allowed to run.

4. The result was obtained.

5. The multiple sequence alignment using Clustal Omega for query was checked.
38
39
40
41

You might also like