0% found this document useful (0 votes)
15 views

Lecture2-DataMining for Bioinformatics

Uploaded by

shoyo3918
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lecture2-DataMining for Bioinformatics

Uploaded by

shoyo3918
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Data Mining for

Bioinformatics
Dr. Y. V. Lokeswari
Associate Professor
SSN College of Engineering
Data Mining in Bioinformatics
• Data mining in bioinformatics implies extracting valuable information from a large amount of
incomprehensible, biological data. It is a process that leads to knowledge discovery.
• Data mining in bioinformatics deals with different techniques and algorithms to gain knowledge from
data of biological sequences, structures and microarrays.
• Biomedical Data Analysis
• Major Nucleotide Sequence Database, Protein Sequence Database, and Gene Expression
Database
• A DNA sequence consists of four components, namely, adenine (A), cytosine (C), guanine (G) and
thymine (T), specifying the genetic code of the organism.
• A protein sequence consists of 20 amino acids, coded from the coding region of a DNA sequence.
• Gene expression data measures the expression of a particular gene, whether upregulated, down-
regulated, or non-expressing, under specific conditions in a cell.
Data mining=extracting valuable info from large amt of incomprehensible biological
data (seq, structures and MicroArrays).
DNA= alphabet seq of A,G,C,T
-----> leads to knowledge discovery
there are regions in DNA that help code amino acids.
Uses diff techniques and algos
20 amino acids=1 protein seq
Data Mining in Bioinformatics
• The three major DNA sequence databases
• EMBL (http://www.ebi.ac.uk/embl/index.html) European Bioinformatics Institute (EBI), an
outstation of the European Molecular Biology Laboratory (EMBL)
• GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) GenBank database is maintained by the
National Center for Biotechnology Information (NCBI),
• DDBJ (http://www.ddbj.nig.ac.jp/Welcome-e.html) DNA Data Bank of Japan at the National
Institute of Genetics (NIG) in Japan.
• The three databases have collaborated to form the International Nucleotide Sequence
Database Collaboration (http://www.ncbi.nlm.nih.gov/projects/collab/).
• The three major databases for protein sequence are:
• Swiss-Prot (http://www.ebi.ac.uk/swissprot/index.html). Swiss Institute for Bioinformatics (SIB)
• TrEMBL (http://www.ebi.ac.uk/trembl/index.html). The TrEMBL database, maintained by EBI,
contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence
Database,
• PIR (http://pir.georgetown.edu/pirwww/). The Protein Information Resource (PIR), located at
Georgetown University Medical Center, is an integrated public bioinformatics resource that supports
genomic and proteomic research and scientific studies.
Data Mining in Bioinformatics
• The Microarray Gene Expression Data (MGED) Society (http://www.mged.org/index.html) is an
international organization of biologists, computer scientists, and data analysts that aims to facilitate
the sharing of microarray data generated by functional genomics and proteomics experiments.
• The ArrayExpress at the EBI (http://www.ebi.ac.uk/arrayexpress/index.html) is a public repository
for microarray data.
• The Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) at NCBI is a gene expression
and hybridization array data repository.
Data Mining in Bioinformatics
• Software Tools for Bioinformatics Research
• The software tools that facilitate research in bioinformatics can be broadly categorized into four
classes:
• (1) data retrieval tools, (2) sequence comparison and alignment tools, (3) pattern discovery tools,
and (4) visualization tools
• A major tool for data retrieval is Entrez. Others are DBGET/ linkDB and SRS – Sequence Retrieval System
• Entrez is an integrated data retrieval system developed by NCBI that provides integrated access to a
wide range of data domains, including literature, nucleotide and protein sequences, complete
genomes, 3D structures, and more..
• One can use Entrez to:
• Identify a representative, well annotated mRNA sequence record from the millions of sequences
in the Entrez Nucleotide data domain.
• Retrieve associated literature and protein records.
• Identify conserved domains within the protein.
• Identify known mutations within the gene or protein.
• Find a resolved three-dimensional structure for the protein, or, in its absence, identify structures
with homologous sequence.
• View the genomic context of the gene and download the sequence region.
Data Mining in Bioinformatics
• Sequence comparison and alignment tools are
• BLAST (Basic Local Alignment Search Tool, available at http://www.ncbi.nlm. nih.gov/BLAST/)
• BLAST is used for comparing gene and protein sequences against others in public databases.
• FASTA (FAST Alignment, available at http://www.ebi.ac.uk/fasta33/)
• FASTA can be used for a fast protein comparison or a fast nucleotide comparison.
• Multiple sequence alignment, the tool available is ClustalW and Custal Omega
• Refer to https://www.youtube.com/watch?v=LokO-iFJdqc
• ClustalW can be used to align DNA or protein sequences in order to elucidate their relationships
as well as their evolutionary origin.
• Pattern discovery tools are used to search for patterns or features in the data.
• An important pattern discovery tool is cluster analysis
• It is used to find groupings in a given dataset such that objects in the same group are similar to each
other while objects in different groups are dissimilar.
• Cluster analysis has been used extensively in gene expression data analysis (see
http://rana.lbl.gov/EisenSoftware.htm).
• Two useful integrated tools for pattern discovery are
• Expression Profiler (http://ep.ebi.ac.uk/EP/)
• GeneQuiz (available at http://jura.ebi.ac.uk:8765/ext-genequiz/)
Data Mining in Bioinformatics
• Visualization tools allow an interactive, graphical display of genomic data.
• Most major genome analysis packages, such as Expression Profiler, and GeneQuiz, have
a visualization tool integrated in them.
• Visualization tools available for bioinformatics data are:
• TreeView (available at http://rana.lbl.gov/EisenSoftware.htm),
• BioViews
• Genes_Graph
• Protein Explorer (available at http://www.proteinexplorer.org)

You might also like