0% found this document useful (0 votes)

63 views47 pages

Big Data Challenges in Bioinformatics

The document discusses big data challenges in bioinformatics. Rapid growth in data from scientific instruments like the Large Hadron Collider and Large Synoptic Survey Telescope is driving a data deluge. Genome sequencing costs have plummeted, but computing challenges remain as analyzing large datasets can take weeks or months. Harnessing the volume, variety and velocity of big data requires deriving value through data analytics and machine learning. Solving these computational challenges will involve distributed processing across many servers and devices as well as improved data management and infrastructure.

Uploaded by

Mirella Flores

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views47 pages

Big Data Challenges in Bioinformatics

Uploaded by

Mirella Flores

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Big Data Challenges

in Bioinformatics

BARCELONA SUPERCOMPUTING CENTER

COMPUTER SCIENCE DEPARTMENT
Autonomic Systems and eBusiness Pla?orms

Jordi Torres
Jordi.Torres@bsc.es

Talk outline
! We talk about Petabyte?

Deluge of Data

Data is now considered the

Fourth Paradigm in Science
the first three paradigms were experimental, theoretical and
computational science.

This shift is being driven by the rapid growth in data

from improvements in

scientific instruments

Scientific instruments
Physics

Large Hadron Collider

produced around 15 petabytes
of data in 2012.

Astronomy

Large Synoptic Survey

Telescope its anticipated to
produce around 10 petabytes
per year.
4

Example: In Genome Research?

Cost of sequencing a human-sized genome
$100000000.0

$10000000.0

$1000000.0

$100000.0

Source: National Human Genome Research Institute (NHGRI) http://www.genome.gov/sequencingcosts/

Jan-13

Sep-12

Jan-12

May-12

Sep-11

May-11

Jan-11

Sep-10

May-10

Jan-10

Sep-09

May-09

Jan-09

Sep-08

May-08

Jan-08

Sep-07

May-07

Jan-07

Sep-06

May-06

Jan-06

Sep-05

May-05

Jan-05

Sep-04

Jan-04

May-04

Sep-03

May-03

Jan-03

Sep-02

May-02

Jan-02

$1000.0

Sep-01

$10000.0

Data Deluge: Due to the changes in big data generation

Example: Biomedicine

Image source: Big Data in biomedicine, Drug Discovery Today. Fabricio F. Costa (2013)

Important open issues

Transfer of data from one location to another (*)
shipping external hard disks
processing the data while it is being transferred
Future? Data

wont be moved!

(*) Out of scope of this presentation

Source: http://footage.shutterstock.com/clip-4721783-stock-footage-animation-presents-datatransfer-between-a-computer-and-a-cloud-a-concept-of-cloud-computing.html

Important open issues

Security and privacy of the data from individuals (*)
The same problems that appear in other areas
Use advanced encryption algorithms

http://www.google.es/url?
sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&docid=FsNQD0JQszy2oM&tbnid=gsXeMMans2soOM:&ved=0CAUQjRw&
url=http%3A%2F%2Fwww.tbase.com%2Fcorporate%2Fprivacy-andsecurity&ei=M__3Ur2EK4bQtAbSx4CoDA&psig=AFQjCNGD5uUaqRoyOqw687J-ATfdtBHTrA&ust=1392070808227629

(Out of scope of this presentation)

Source: http://www.tbase.com/corporate/privacy-and-security

Important open issues

Increased need to store data (*)

Cloudbased computing solutions have emerged

Source: http://www.custodia-documental.com/wp-content/uploads/Cloud-Big-Data.jpg

Important open issues

Increased need to store data (*)

Cloudbased computing solutions have emerged
The most common Cloud Computing inhibitors should be
tackled

Security

Privacy

Lack of
Standards

Data
Integrity

Regulatory

Data
Recovery

Control

Vendor
Maturity

...

The most critical open issue

DERIVING VALUE VIA HARNESSING

VOLUME,
VARIETY AND
VELOCITY (*)

Source: http://www.theatlantic.com/health/archive/2012/05/big-data-can
-save-health-care-0151-but-at-what-cost-to-privacy/257621/

(*) Big Data definition?

The most critical open issue

DERIVING VALUE VIA HARNESSING

VOLUME,
VARIETY AND
VELOCITY (*)

(*) Big Data definition?

The most critical open issue

Source: cetemma - matar

DERIVING VALUE VIA HARNESSING

VOLUME,
VARIETY AND
VELOCITY (*)

(*) Big Data definition?

The most critical open issue

DERIVING VALUE VIA HARNESSING

VOLUME,
VARIETY AND
VELOCITY (*)

The informa=on is
non ac=onable
knowledge
(*) Big Data definition?

What is the usefulness of Big Data?

Performs predictions of outcomes and behaviors

Data

+
Volume

Value

InformaMon

Knowledge

Approach: Machine Learning works in the sense that

these methods detect subtle structure in data relatively
easily without having to make strong assumptions about
parameters of distributions

Data Analytics: To extract knowledge

Big data uses inductive statistics and concepts

from nonlinear system identification to infer laws
(regressions, nonlinear relationships, and causal
effects) from large data sets to reveal
relationships, dependencies, and to perform
predictions of outcomes and behaviors.
(*) Wikipedia

(Out of scope of this presentation) :- (

Talk outline
! We talk about Petabyte?

Challenges related to us?

The big data

problem:
In the end it is
a Computing
Challenge
18

Computing Challenge
Researchers need to crunch a large amount of data very
quickly (and easily) using high-performance computers

Example: A de novo assembly algorithm for DNA data

finds reads whose sequences overlap and records
those overlaps in a huge diagram called an assembly
graph. For a large genome, this graph can occupy many
terabytes of RAM, and completing the genome
sequence can require weeks or months of
computation on a world-class supercomputer.

What does life science research do at BSC?

Pipeline schema:

MareNostrum
25-40 h / 50-100 cpus

AlMx / CLL cluster

0. Receive the data:

Raw Genome Data: 120Gb

Common
Storage
50-60h / 1-10 cpus

Also data deluge appears in genomics

The DNA data deluge

comes from thousands of
sources
More than 2000
sequencing instruments
around the world
more than 15 petabytes x
year of genetic data.

And soon, tens of

thousands of sources!!!!
Image source: https://share.sandia.gov/news/resources/
news_releases/images/2009/biofuel_genes.jpg

The total computing burden is growing

DNA sequencing
is on the path to
becoming an
everyday tool in
medicine.

Computing, not sequencing, is now the slower

and more costly aspect of genomics research.

How can we help at BSC?

Something must be done now, or else well need to put
vital research on hold while the necessary
computational techniques catch upor are invented.
What is clear is that it will involve both better algorithms
and a renewed focus on such big data approaches in
managing and processing data.
How?
Doing outstanding research to speedup this process

What is the time required to retrieve information?

1 Petabyte = 1000 x

(1 Terabyte )

What is the time required to retrieve information?

assume
100MB/sec

What is the time required to retrieve information?

assume
100MB/sec
scanning 1 Terabyte:

more than 5 hours

What is the time required to retrieve information?

scanning 1 Petabyte:

more than 5.000 hours

Solution?

massive parallelism
not only in computation but also in storage

assume 10.000 disks:

scanning 1 TB takes 1 second
Source: hVp://www.google.com/about/datacenters/gallery/images/_2000/IDI_018.jpg

Talk outline
! We talk about Petabyte?

Research in Big Data at BSC

To support this massive data parallelism & distribution it is
necessary to redefine and improve:

Data Processing across hundreds of

thousands of servers
Data Management across hundreds
of thousands of data devices
Dealing with new System Data
Infrastructure
30

How?

How do companies like

google read and process
data from 10.000 disks in
parallel?
Source: hVp://www.google.com/about/datacenters/gallery/images/_2000/IDI_018.jpg

GOOGLE: New programming model

To meet the challenges: MapReduce

Programming Model introduced by Google in
early 2000s to support distributed computing
(special emphasis in fault-tolerance)

Ecosystem of big data processing tools

open source, distributed, and run on commodity
hardware.

MapReduce: some details

The key innovation of
MapReduce is
the ability to take a query over a
data set, divide it, and run it in
parallel over many nodes.

Input Data

Mappers

Two phases
Map phase
Reduce phase

Reducers

Output Data

Limitations of MapReduce as a Programming model?

MapReduce is great but not every one
is a MapReduce expert

I am a python expert but .

There is a class of algorithms that
cannot be efficiently implemented with
the MapReduce programming model

Different programming models deal

with different challenges
Example: pyCOMPSs from BSC

Input Data

Output Data

OmpSs/COMPSs
34

Big Data resource management

Big Data characteristics

Requirements from data store

Volume

Scalability

Variety

Scheme-less

Velocity

Relaxed consistency & capacity to digest

Relational databases are not suitable for Big Data problems

Lack of horizontal scalability

Complex structures to express complex relationships
Hard consistency

Non-relational databases (NoSQL) are the alternative data store

Relaxing consistencyEventual consistency
35

General view of NoSQL storage (and replication)

1
2
1

2 3

1
2
1

2 5

4 5

3 4

3 5

Big Data resource management: open issues in NoSQL

Query performance depends heavily on data model
Designed to support many concurrent short queries
Solutions:
automatic configuration, query plan and data organization
BSC- Aeneas: https://github.com/cugni/aeneas

Model_A
Model_B
Model_C

Client

Query
Driven

Query
Plan

New System Data Infrastructure required

Example: Current computer systems available at genomics
research institutions are commonly designed to run general
computational jobs where,
Traditionally the limiting resource is the use of CPU.
Also, we find a large common storage space shared for all nodes.

Example: Computers in use for bioinformatic jobs

Typical mix of such computer systems and common
bioinformatics applications:
Bottlenecks & underutilization
Figure 16. EDW Applianceproblems
is a loosely coupled, shared nothing, MPP architecture

First approach:
Big Data Rack Architecture:
Shared Nothing

Storage Technology: Non Volatile Memory evolution

Evolution of Flash Adoption

FLASH + DISK

FLASH AS DISK

FLASH AS
MEMORY

SNIA NVM Summit

April 28, 2013

(*) HHD 100 cheaper than RAM . But 1000 times slower

Example: Computers in use for bioinformatic jobs

Jobs are responsible of managing the input data: partitions,
organisation, merge of intermediate results
Large parts of code are not functional, but housekeeping tasks

Solutions: Active storage strategies for leveraging highperformance in-memory key/value databases to accelerate
data intensive tasks
Compute Dense
Compute Fabric

Active Storage Fabric

Archival Storage
Disk/Tape

Important: Remote Nodes Have Gotten Closer

Interconnects have
become much faster
IB latency 2000 ns is
only 20x slower that
RAM and is 100x
faster that SSD

Source: http://www.slideshare.net/blopeur/hecatonchire-kvm-forum2012benoithudzia
42

Conclusion: Paradigm shift

Old

New

Compute-centric Model

Data-centric Model

Manycore

FPGA

Massive Parallelism
Persistent Memory

Flash

Phase Change

Source: Heiko Joerg http://www.slideshare.net/schihei/petascale-analytics-the-world-of-big-data-requires-big-analytics

Conclusions: How can we help?

How can IT researchers help scientists like you

cope with the onslaught of data?
This is a crucial question and there is no definitive answer yet.
What is clear is that it will involve both better algorithms and a
renewed focus on big data approaches such as: data
infrastructure, data managing and data processing.

Questions & Answers

Over to you,
what do you think?
Thank you for your attention! - Jordi

Thank you to

More information

Updated information will be posted at

www.JordiTorres.eu

V2460 Manual de Instrução
No ratings yet
V2460 Manual de Instrução
44 pages
ZEN 3.2 (Blue Edition) - Installation Guide PDF
No ratings yet
ZEN 3.2 (Blue Edition) - Installation Guide PDF
36 pages
0 Principles of Big Data
No ratings yet
0 Principles of Big Data
70 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
221 pages
Data Science and Big Data Analytics_ Unit_1
No ratings yet
Data Science and Big Data Analytics_ Unit_1
47 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Crash Course Big Data
From Everand
Crash Course Big Data
IntroBooks Team
No ratings yet
Chapter1 PDF
No ratings yet
Chapter1 PDF
30 pages
Unit I
No ratings yet
Unit I
66 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
BDA2023Outline
No ratings yet
BDA2023Outline
7 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Lecture 6 BigData
No ratings yet
Lecture 6 BigData
61 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data
No ratings yet
Big Data
41 pages
BigData Nptel
No ratings yet
BigData Nptel
813 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Unit I LM
No ratings yet
Unit I LM
12 pages
Week6 Iot Big Data
No ratings yet
Week6 Iot Big Data
21 pages
Current Big Data Issues and Their Solutions Via Deep Learning: An Overview
No ratings yet
Current Big Data Issues and Their Solutions Via Deep Learning: An Overview
12 pages
L1
No ratings yet
L1
53 pages
Big Data and Genomics
No ratings yet
Big Data and Genomics
17 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
5 pages
Big Data
No ratings yet
Big Data
25 pages
UNIT-IV PDF
No ratings yet
UNIT-IV PDF
26 pages
290-MIS-L5
No ratings yet
290-MIS-L5
22 pages
The Age of Big Data: Kayvan Tirdad
No ratings yet
The Age of Big Data: Kayvan Tirdad
26 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
BigData Processing Intro
No ratings yet
BigData Processing Intro
34 pages
Report On Big Data
No ratings yet
Report On Big Data
23 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
Big Data Analysis
No ratings yet
Big Data Analysis
3 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Big Data-Hadoop
No ratings yet
Big Data-Hadoop
6 pages
Data Science
No ratings yet
Data Science
31 pages
Lecture 1
No ratings yet
Lecture 1
22 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Intro Biol Notes
No ratings yet
Intro Biol Notes
49 pages
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
No ratings yet
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
19 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
hadoop-big-data-unit-2
No ratings yet
hadoop-big-data-unit-2
23 pages
Big Data
No ratings yet
Big Data
76 pages
UNIT 1Big Data Introduction (1)
No ratings yet
UNIT 1Big Data Introduction (1)
56 pages
Ashish_Presentation_Stage1_modify_LR
No ratings yet
Ashish_Presentation_Stage1_modify_LR
24 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
Basics of Big Data
No ratings yet
Basics of Big Data
14 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
An Introduction To Big Data
No ratings yet
An Introduction To Big Data
31 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
Big Data Analytics
No ratings yet
Big Data Analytics
45 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
BDA-UNIT-I-LM
No ratings yet
BDA-UNIT-I-LM
14 pages
Overview of Big Data: Saidatul Rahah Hamidi
No ratings yet
Overview of Big Data: Saidatul Rahah Hamidi
25 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
Bigdata Documentation
No ratings yet
Bigdata Documentation
20 pages
Big Data: the Revolution That Is Transforming Our Work, Market and World
From Everand
Big Data: the Revolution That Is Transforming Our Work, Market and World
PAT NAKAMOTO
No ratings yet
Powerpoint Presentation On Project
No ratings yet
Powerpoint Presentation On Project
18 pages
Mce Cambridge Guide - Ebook Features
No ratings yet
Mce Cambridge Guide - Ebook Features
30 pages
Mahindra Two Wheelers: Case Study
No ratings yet
Mahindra Two Wheelers: Case Study
3 pages
Lending-Schema (Branch-Name, Branch-City, Assets, Customer-Name, Loan-Number, Amount)
No ratings yet
Lending-Schema (Branch-Name, Branch-City, Assets, Customer-Name, Loan-Number, Amount)
6 pages
Databricks Learning
No ratings yet
Databricks Learning
1 page
ID09011 B 0281290596
No ratings yet
ID09011 B 0281290596
76 pages
Memory Addressing Modes of 8085
No ratings yet
Memory Addressing Modes of 8085
3 pages
AFT Impulse 7 Data Sheet
No ratings yet
AFT Impulse 7 Data Sheet
2 pages
Carbon Footprint of Spam Email Report - McAfee
No ratings yet
Carbon Footprint of Spam Email Report - McAfee
28 pages
Syscm1 Agd en
No ratings yet
Syscm1 Agd en
138 pages
Manual PDF
No ratings yet
Manual PDF
186 pages
Arts 10 2nd Grading
No ratings yet
Arts 10 2nd Grading
14 pages
lom_log
No ratings yet
lom_log
30 pages
Peace Contract Between PenaVega Universal Media & Gage Kirby/GKC Organization
No ratings yet
Peace Contract Between PenaVega Universal Media & Gage Kirby/GKC Organization
8 pages
Primbon VB Net Mysql Odbc
No ratings yet
Primbon VB Net Mysql Odbc
6 pages
Ajp 3
No ratings yet
Ajp 3
30 pages
LindberghRed Manual
No ratings yet
LindberghRed Manual
32 pages
Abhishek Kumar Singh Resume
No ratings yet
Abhishek Kumar Singh Resume
1 page
Robotics
100% (1)
Robotics
49 pages
Clock With Thermometer Using Arduino, I2c 16x2 LCD, DS1307 RTC and DHT11 Sensor
0% (1)
Clock With Thermometer Using Arduino, I2c 16x2 LCD, DS1307 RTC and DHT11 Sensor
12 pages
CatCost v1-1-0 User Guide
No ratings yet
CatCost v1-1-0 User Guide
62 pages
Implementation of The Internet of Things On Smart Posters Using Near Field Communication Technology in The Tourism Sector
No ratings yet
Implementation of The Internet of Things On Smart Posters Using Near Field Communication Technology in The Tourism Sector
9 pages
Koha User Manual PDF
No ratings yet
Koha User Manual PDF
25 pages
MB-920-demo
No ratings yet
MB-920-demo
7 pages
(IJETA-V9I1P2) :yew Kee Wong
No ratings yet
(IJETA-V9I1P2) :yew Kee Wong
7 pages
Ohmforce Experience Manual
No ratings yet
Ohmforce Experience Manual
45 pages
Great Moments in Microprocessor History
No ratings yet
Great Moments in Microprocessor History
11 pages
Linux Gui Development
No ratings yet
Linux Gui Development
43 pages