Labmanual CS 1
Labmanual CS 1
Labmanual CS 1
For Beginners
LABORATORY MANUAL
Prepared Under
DBT STAR COLLEGE SCHEME
Department of Computer Science
PSGR KRISHNAMMAL COLLEGE FOR WOMEN
College of Excellence
An Autonomous Institution - Affiliated to Bharathiar University
Reaccredited with ‘A’ Grade by NAAC
An ISO 9001:2015 Certified Institution
Peelamedu, Coimbatore – 641 004
Published by
Language : English
Year : 2018
Copyright Warning : N
o part of this book may be reproduced or transmitted in any form or by
any means electronic or mechanical including photocopying or recording
or by any information storage and retrieval system without permission in
writing from PSGR Krishnammal College for Women, Coimbatore, Tamil
Nadu, India.
Aim:
To explore the site map of NCBI and PUBMED and to study the resources avai-
lable on NCBI and PUBMED.
NCBI
The NCBI has developed many useful resources and tools, several of which
are described throughout this book. Of particular relevance to genome mapping is
the Genomes Division of Entrez. Entrez provides integrated access to different types
of data for over 600 organisms, including nucleotide sequences, protein sequences
with structures, PubMed, MEDLINE and genomic mapping information. The NCBI
Human Genome Map Viewer is a new tool that presents a graphical view of the
available human genome sequence data as well as cytogenetic, genetic, physical,
and radiation hybrid maps. The Map Viewer provides human genome sequence for
finished contigs, BAC tiling path of finished and draft sequence, location of genes,
STSs, and SNPs on finished and draft sequences; it is a useful tool for integrating
maps and sequence.
There are many other tools and databases at NCBI that are useful for gene
mapping projects including BLAST, GeneMap’99, LocusLink, OMIM , dbSTS, dbSNP,
dbEST, and UniGene databases. The BLAST can be used to search DNA sequen-
ces for the presence of markers, to confirm and refine map localizations. LocusLink
(Pruitt et al., 2000) presents information on official nomenclature, aliases, sequence
accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology,
map locations, and related Web sites. The dbSTS and dbEST databases themselves
play a lesser role in human and mouse gene mapping endeavors as their relevant
information has already been captured by other more detailed resources (LocusLink,
GeneMap’99, UniGene, MGD, and eGenome) but are currently the primary source of
genomic information for other organisms.
PUBMED:
PubMed contains all of MEDLINE, as well as citations provided directly by the
publishers. As such, PubMed contains more recent articles than MEDLINE, as well
as articles that may never appear in MEDLINE because of their subject matter. This
development led NCBI to introduce a new article identifier, called PubMed identifier
(PMID). Articles appearing in MEDLINE will have both a PMID and an MUID. Articles
appearing only in PubMed will have only a PMID.
Result :
The NCBI and PUBMED websites are explored and the resources are studied.
Aim:
To retrieve a nucleotide sequence of interest from Genbank entry with Specific
accession number.
GenBank
GenBank® is the NIH genetic sequence database, an annotated collection of all
publicly available DNA sequences. GenBank is part of the International Nucleotide
Sequence Database Collaboration, which comprises the DNA Data Bank of Japan
(DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI.
These three organizations exchange data on a daily basis.
A GenBank release occurs in every two months and is available from the FTP
site. The release notes for the current version of GenBank provide detailed informa-
tion about the release and notifications of upcoming changes to GenBank. Release
notes for previous GenBank releases are also available. GenBank growth statistics
for both the traditional GenBank divisions and the WGS division are available from
each release.
An annotated sample GenBank record for a Saccharomyces cerevisiae gene
demonstrates many of the features of the GenBank flat file format.
Access to GenBank
There are several ways to search and retrieve data from GenBank.
Search GenBank for sequence identifiers and annotations with Entrez Nucleo-
tide, which is divided into three divisions: CoreNucleotide (the main collection),
dbEST (Expressed Sequence Tags), and dbGSS (Genome Survey Sequences).
Search and align GenBank sequences to a query sequence using BLAST (Basic
Local Alignment Search Tool). BLAST searches CoreNucleotide, dbEST, and
dbGSS independently.
Search, link, and download sequences programmatically using NCBI e-utilities.
The ASN.1 and flat file formats are available at NCBI’s anonymous FTP ser-
ver: ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1 and ftp://ftp.ncbi. nlm.nih.gov/
genbank.
intellectual property rights in all or a portion of the data they have submitted. NCBI
is not in a position to assess the validity of such claims and therefore cannot provide
comment or unrestricted permission concerning the use, copying or distribution of
the information contained in GenBank.
Procedure:
1. Open the web browser and type the database address http://www.ncbi.nlm.
nih.gov/genbank/ in the address bar.
2. Type the query (i.e., the strain name for which the nucleotide sequence has to
be retrieved) in the search bar.
3. Select the particular sequence which is required.
4. Select FASTA format in the display settings and click apply.
5. Copy the sequence data and paste it in the notepad.
6. Save the notepad file in FASTA format.
7. Report the result.
Observations:
Result:
Thus the nucleotide sequence (Accession Number: E01306.1) was retrieved
from GenBank Database.
Aim:
To retrieve the nucleotide sequence of interest from the National Center for
Biotechnology Information (NCBI) database.
NCBI Database:
The National Center for Biotechnology Information (NCBI) advances science
and health by providing access to biomedical and genomic information. NCBI crea-
tes public databases, conducts research in computational biology, develops software
tools for analyzing genome data, disseminates biomedical information for the better
understanding of diseases and molecular processes affecting human health. This
database groups biomedical literature, small molecules, and sequence data in terms
of biological relationships. It helps in retrieving DNA & peptide sequences, abstracts
to scientific articles, and structural coordinates to visualize the 3D structure of resol-
ved molecules.
Procedure:
1. Open the web browser and type the database address www.ncbi.nlm.nih.gov
in the address bar.
2. Type the query, AF375082 (i.e., the accession number of a gene for which the
nucleotide sequence has to be retrieved) in the search bar.
3. Select the particular sequence which is required.
4. Select FASTA format in the display settings and click apply.
5. Copy the sequence data and paste it in the notepad.
6. Save the notepad file in FASTA format.
7. Report the result.
Observations:
Result:
Thus the nucleotide sequence of AF375082 was retrieved from NCBI Database.
Aim:
To analyze the NCBI web site and find the official gene symbol, its alias name,
chromosome number and its ID.
Procedure:
1. Open the web browser and type the database address http://www.ncbi.nlm.
nih.gov/gene/ in the address bar.
2. Type the query (i.e., the strain name for which the nucleotide sequence has to
be retrieved) in the search bar.
3. Select the particular sequence which requires the official gene symbol, name,
chromosome number and its ID.
4. Select FASTA format in the display settings and click apply.
5. Down load the file in FASTA format.
6. Report the result.
Fig.13. Entries about a gene (BDNF – Brain Derived Neurotrophic Factor) available
in “Gene” data bank
Output:
Kingdom: Eukaryota
Subgroup: Mammals
Sequence data: genome assemblies: 43
Haploid chromosomes: 48
Result:
The official gene symbol, alias name, chromosome number and ID has been
retrieved from the NCBI.
Aim:
To analyse and retrieve the protein sequence of a protein from the Protein Data
Bank (PDB) database.
Procedure:
1. Open the web browser and type the database address www.pdb.org in the
address bar.
2. Type the query (i.e., the protein name for which the amino acid sequence has
to be retrieved) in the search bar of the home page.
3. Select the particular sequence which is required.
4. Download the file in FASTA format from the option “Download Files”.
5. Report the result.
Observation:
Result:
Thus the required protein sequence is retrieved from the PDB database.
Aim:
To retrieve the structure of a protein from Protein Data Bank (PDB) database.
Procedure:
1. Open the web browser and type the database address www.pdb.org in the
address bar.
2. Type the query (the protein name for which structure has to be retrieved) in
the search bar of the home page.
3. Select the protein of interest.
4. Download the file in PDB format from the option “Download Files”.
5. Report the result.
Observation:
>2LOK:A|PDBID|CHAIN|SEQUENCEMSPIPLPVTDTDDAWRARIAAHRADKDEFLAT
HDQSPIPPADRGAFDGLRYFDIDASFRVAARYQPARDPEAVELETTRGPPAEYTRAAVLG
FDLGDSHHTLTAFRVEGESSLFVPFTDETTDDGRTYEHGRYLDVDPAGADGGDEVAL-
DFNLAYNPFCAYGGSFSCALPPADNHVPAAITAGERVDADLEHHHHHH
Result:
The required primary protein structure is retrieved from the Protein Database
in PDB format.
Aim:
To compute the physical and chemical parameters of the given protein.
ProtParam:
ProtParam (Protein Parameters) is a tool which allows the computation of
various physical and chemical parameters for a given protein stored in Swiss-Prot or
TrEMBL or for a user entered sequence. The computed parameters include the mole-
cular weight, theoretical pI, amino acid composition, atomic composition, extinction
coefficient, estimated half-life, instability index, aliphatic index and grand average of
hydropathicity.
Procedure:
1. Open the web browser and type the database address www.expasy.org/tools
in the address bar.
2. Select ProtParam from the list of displayed tools.
3. Paste the query sequence in the box given.
4. Click the option “Compute Parameters”.
5. Report the result.
Observations:
Fig.18. Amino acid content and other basic information of a given protein in
ProtParam
Fig.19. Atomic content and few other physio-chemical features of a given protein in
ProtParam
Theoretical pI 8.38
Result:
Thus, protein parameters were computed using ProtParam tool.
Aim:
To retrieve the secondary structure of a protein from Protein Data Bank (PDB)
database.
Procedure:
1. Open the web browser and type the database address www.pdb.org in the
address bar.
2. Type the query (i.e., the protein name for which structure has to be retrieved)
in the search bar of the home page.
3. Select the protein of interest.
4. Click on the required part of the protein
5. Report the result.
Observation:
Result:
The required secondary protein structure is retrieved from the Protein Data-
base in pdb format.
Aim:
To view the tertiary structure and analyze the protein using Rasmol.
RasMol:
RasMol is a computer program written for molecular graphics visualization
intended and used primarily for the depiction and exploration of biological macro-
molecule structures, such as those found in the Protein Data Bank. It was originally
developed by Roger Sayle in the early 90s.
Procedure:
1. Open the working page of offline tool.
2. Upload the pdb file of desired protein.
3. Structure will be displayed.
4. View the different forms of structure using display option.
5. Change the format into GIF in export option.
Observation:
Result:
Thus, the tertiary structure of protein is viewed in Rasmol.
Aim:
To perform pair wise and multiple sequence alignment using clustalw for given
sequences.
Procedure:
1. Open the web browser and type www.ebi.ac.uk/Tools/msa/ clustalw.
2. Upload the sequences from the Notepad or paste the sequences in FASTA
format.
3. Upload two sequences for pair-wise alignment or more than two sequences for
multiple sequence alignment After uploading, choose the “Execute Multiple
Alignment” option in the alignment icon.
4. Sequence alignment results will be appeared within few seconds after
execution.
5. Report the result.
Observation:
Result:
Thus the given sequences are aligned using ClustalW.
Aim:
To perform pair wise and multiple sequence alignment using BLAST tool.
Sequence Similarity:
Database Similarity Searches have become a mainstay of Bioinformatics.
Sequence database searches can also be remarkably useful for finding the function
of genes whose sequences have been determined in the laboratory. The sequence of
the gene of interest is compared to every sequence in a sequence database, and the
similar ones are identified. Alignments with the best-matching sequences are shown
and scored. If a query sequence can be readily aligned to a database sequence of
known function, structure, or biochemical activity, the query sequence is predicted
to have the same function, structure, or biochemical activity.
Pairwise sequence alignment methods are concerned with finding the best-
matching piecewise local or global alignments of protein (amino acid) or DNA (nucleic
acid) sequences. Multiple alignment is an extension of pairwise alignment to incor-
porate more than two sequences into an alignment. Multiple alignment methods try
to align all of the sequences in a specified set.
BLAST
BLAST is an acronym for Basic Local Alignment Search Tool. BLAST is a sophi-
sticated software package that has become the single most important piece of sof-
tware in the field of Bioinformatics. The following tools are different type of blast
programs used with respect to the aim of the user.
Blast programs:
BLASTp – Compares an amino acid query sequence against a protein sequence
database.
BLASTn – Compares a nucleotide query sequence against a nucleotide sequence
database.
BLASTx – Compares a nucleotide query sequence translated in all reading fra-
mes against a protein sequence database.
tBLASTn – Compares a protein query sequence against a nucleotide sequence
database dynamically translated in all reading frames.
tBLASTx – Compares the six frame translation of a nucleotide query sequence
against the 6 frame translation of a nucleotide sequence database.
Megablast – For highly similar sequences.
Modes of BLAST:
1. PHI BLAST
2. PSI BLAST
1. PHI BLAST
Pattern Hit Intiated BLAST (PHI-BLAST) searches protein sequences using a combi-
nation of pattern matching and local alignment to reduce the probability of false posi-
tives. This option matches a regular expression that is specified. Given a sequence
and a pattern that is occurring within that sequence this program finds all the other
sequences that is occurring in that sequence.
2. PSI BLAST
Position Specific Initiated BLAST (PSI-BLAST) refers to a feature of BLAST 2.0 in
which a profile automatically froms a multiple alignment of the highest scoring hits
in an initial BLAST search. The PSSM is generated by calculating position-specific
scores for each position in the alignment. Highly conserved positions receive high
scores and weakly conserved positions receive scores near zero.
Procedure:
1. Open the web browser and type http://blast.ncbi.nlm.nih.gov/ Blast.cgi
2. Click either nucleotide blast or protein blast icon according to the requirement
3. Select “Align two or more sequences” check box for opting multiple sequence
alignment or deselect for pair wise alignment
4. Upload or paste a query sequence (in FASTA format) in the query box and
execute BLAST for pair wise alignment. This will be identifying most similar
sequences from the databank.
5. Upload or paste a query sequence (in FASTA format) in the query box and
upload more than one sequences (in FASTA format) in the subject box and
then execute BLAST for multiple sequence alignment. This will be identifying
the similarity /dissimilarity among the sequences.
6. Report the result.
Observation :
Fig. 26. Pair-wise sequence alignment using BLAST identified similar sequences
from databanks
Fig. 27. Multiple sequence alignment using BLAST identified similarity between
query sequence and other sequences
Result:
Thus the given sequences are aligned using BLAST
Aim:
To align two different sequences using Fold and Function Assignment System
(FFAS).
Procedure:
1. Open the web browser and type www.ffas.ycrf.edu in the address bar.
2. Open the “Pair-wise Alignment” link in the home page of FFAS.
3. Paste the queries in FASTA format in two different dialogue boxes.
4. Report the result.
Observation:
Fig.29. Alignment of two retrieved sequences using PAM matrix in FFAS tool
Sequences aligned:
% Identity : 100%
% Gap : 0%
Expected Value : -0.952E+02
Result:
Thus the two sequences are aligned pair-wise using Fold and Function Assi-
gnment System.
Aim:
To align two sequences and find the BLOSUM scoring matrix.
Procedure:
1. Open the browser and type www.ebi.ac.uk/Tools/msa/clustalw2.
2. Upload the sequences from the Notepad or paste the sequences in FASTA
format.
3. Choose the “Full Alignment” option in the alignment icon.
4. Choose BLOSUM Matrix from the matrix option.
5. Multiple sequence alignment is shown.
6. Report the result.
Observation:
Result:
Thus the given three sequences are aligned using ClustalW with BLOSUM
Marix.
Aim:
To search the similar sequence of given query using Basic Local Alignment
Search Tool (BLAST).
BLAST:
The BLAST finds regions of local similarity between sequences. The program com-
pares nucleotide or protein sequences to sequence databases and calculates the stati-
stical significance of matches. BLAST can be used to infer functional and evolutionary
relationships between sequences and helps to identify the members of gene families.
Procedure:
1. Open the web browser and type the database address www.ncbi.nlm.nih.gov/
blast/ in the address bar.
2. Upload or paste the sequence, for which the similar sequences have to be
retrieved, in the dialogue box, in FASTA format.
3. Click “Begin Search” option so that the page containing the organism’s ID and
other details are displayed.
4. Click “View Reports” and the sequences similar to the given sequence of inte-
rest are displayed.
5. Report the sequences which have maximum identity and similarity.
Observation:
Query ID : lcl|3351
Organism : Bacillus subtilis
Nucleotide : 16S ribosomal RNA
Query length : 300 bp
Result:
The similar sequences were retrieved and the % gap, %identity and the expected
value for the sequences were known.
Aim:
To convert the given gene sequence into its corresponding amino acid sequence.
Procedure:
1. Open the web browser and type the database address www.expasy.org/tools
in the address bar.
2. Select TRANSLATE from the list of displayed tools.
3. Paste the query sequence in the given box.
4. Click TRANSLATE option.
5. Report the result.
Observation:
Result:
Thus the given gene sequence is converted into its corresponding Protein
sequence.
EXPERT REVIEW
Merits
1. This book is very helpful for beginners as they can gain
knowledge in basic tools and databases of bio informa-
tics like BLAST, RASMOL, ClustalW, NCBI etc.
2. Students can gain knowledge in Pair-wise sequence
Dr S.Meenatchi Sundaram
and Multiple Sequence by the above mentioned tools.
Associate Professor &
3. The content was neatly planned and well organized,
Director (Research),
which helps the student to know the basics of bioin-
Department of Microbiology,
formatics and enrich their knowledge in Structure
Nehru Arts and Science,
Analysis of Protein and Sequence alignment using
Coimbatore -641 105
Bioinformatics tools.