0% found this document useful (0 votes)
63 views47 pages

Big Data Challenges in Bioinformatics

The document discusses big data challenges in bioinformatics. Rapid growth in data from scientific instruments like the Large Hadron Collider and Large Synoptic Survey Telescope is driving a data deluge. Genome sequencing costs have plummeted, but computing challenges remain as analyzing large datasets can take weeks or months. Harnessing the volume, variety and velocity of big data requires deriving value through data analytics and machine learning. Solving these computational challenges will involve distributed processing across many servers and devices as well as improved data management and infrastructure.

Uploaded by

Mirella Flores
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views47 pages

Big Data Challenges in Bioinformatics

The document discusses big data challenges in bioinformatics. Rapid growth in data from scientific instruments like the Large Hadron Collider and Large Synoptic Survey Telescope is driving a data deluge. Genome sequencing costs have plummeted, but computing challenges remain as analyzing large datasets can take weeks or months. Harnessing the volume, variety and velocity of big data requires deriving value through data analytics and machine learning. Solving these computational challenges will involve distributed processing across many servers and devices as well as improved data management and infrastructure.

Uploaded by

Mirella Flores
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Big Data Challenges

in Bioinformatics

BARCELONA SUPERCOMPUTING CENTER


COMPUTER SCIENCE DEPARTMENT
Autonomic Systems and eBusiness Pla?orms

Jordi Torres
Jordi.Torres@bsc.es

Talk outline
! We talk about Petabyte?

Deluge of Data

Data is now considered the


Fourth Paradigm in Science
the first three paradigms were experimental, theoretical and
computational science.

This shift is being driven by the rapid growth in data


from improvements in

scientific instruments

Scientific instruments
Physics

Large Hadron Collider


produced around 15 petabytes
of data in 2012.

Astronomy

Large Synoptic Survey


Telescope its anticipated to
produce around 10 petabytes
per year.
4

Example: In Genome Research?


Cost of sequencing a human-sized genome
$100000000.0

$10000000.0

$1000000.0

$100000.0

Source: National Human Genome Research Institute (NHGRI) http://www.genome.gov/sequencingcosts/

Jan-13

Sep-12

Jan-12

May-12

Sep-11

May-11

Jan-11

Sep-10

May-10

Jan-10

Sep-09

May-09

Jan-09

Sep-08

May-08

Jan-08

Sep-07

May-07

Jan-07

Sep-06

May-06

Jan-06

Sep-05

May-05

Jan-05

Sep-04

Jan-04

May-04

Sep-03

May-03

Jan-03

Sep-02

May-02

Jan-02

$1000.0

Sep-01

$10000.0

Data Deluge: Due to the changes in big data generation


Example: Biomedicine

Image source: Big Data in biomedicine, Drug Discovery Today. Fabricio F. Costa (2013)

Important open issues


Transfer of data from one location to another (*)
shipping external hard disks
processing the data while it is being transferred
Future? Data

wont be moved!

(*) Out of scope of this presentation


Source: http://footage.shutterstock.com/clip-4721783-stock-footage-animation-presents-datatransfer-between-a-computer-and-a-cloud-a-concept-of-cloud-computing.html

Important open issues


Security and privacy of the data from individuals (*)
The same problems that appear in other areas
Use advanced encryption algorithms

http://www.google.es/url?
sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&docid=FsNQD0JQszy2oM&tbnid=gsXeMMans2soOM:&ved=0CAUQjRw&
url=http%3A%2F%2Fwww.tbase.com%2Fcorporate%2Fprivacy-andsecurity&ei=M__3Ur2EK4bQtAbSx4CoDA&psig=AFQjCNGD5uUaqRoyOqw687J-ATfdtBHTrA&ust=1392070808227629

(Out of scope of this presentation)


Source: http://www.tbase.com/corporate/privacy-and-security

Important open issues

Increased need to store data (*)


Cloudbased computing solutions have emerged

Source: http://www.custodia-documental.com/wp-content/uploads/Cloud-Big-Data.jpg

Important open issues

Increased need to store data (*)


Cloudbased computing solutions have emerged
The most common Cloud Computing inhibitors should be
tackled

Security

Privacy

Lack of
Standards

Data
Integrity

Regulatory

Data
Recovery

Control

Vendor
Maturity

...

The most critical open issue

DERIVING VALUE VIA HARNESSING



VOLUME,
VARIETY AND
VELOCITY (*)

Source: http://www.theatlantic.com/health/archive/2012/05/big-data-can
-save-health-care-0151-but-at-what-cost-to-privacy/257621/

(*) Big Data definition?

The most critical open issue

DERIVING VALUE VIA HARNESSING



VOLUME,
VARIETY AND
VELOCITY (*)

(*) Big Data definition?

The most critical open issue

Source: cetemma - matar

DERIVING VALUE VIA HARNESSING



VOLUME,
VARIETY AND
VELOCITY (*)

(*) Big Data definition?

The most critical open issue

DERIVING VALUE VIA HARNESSING



VOLUME,
VARIETY AND
VELOCITY (*)

The informa=on is
non ac=onable
knowledge
(*) Big Data definition?

What is the usefulness of Big Data?


Performs predictions of outcomes and behaviors

Data

+
Volume

Value

InformaMon

Knowledge

Approach: Machine Learning works in the sense that


these methods detect subtle structure in data relatively
easily without having to make strong assumptions about
parameters of distributions

15

Data Analytics: To extract knowledge

Big data uses inductive statistics and concepts


from nonlinear system identification to infer laws
(regressions, nonlinear relationships, and causal
effects) from large data sets to reveal
relationships, dependencies, and to perform
predictions of outcomes and behaviors.
(*) Wikipedia

(Out of scope of this presentation) :- (

16

Talk outline
! We talk about Petabyte?

17

Challenges related to us?

The big data


problem:
In the end it is
a Computing
Challenge
18

Computing Challenge
Researchers need to crunch a large amount of data very
quickly (and easily) using high-performance computers

Example: A de novo assembly algorithm for DNA data


finds reads whose sequences overlap and records
those overlaps in a huge diagram called an assembly
graph. For a large genome, this graph can occupy many
terabytes of RAM, and completing the genome
sequence can require weeks or months of
computation on a world-class supercomputer.

19

What does life science research do at BSC?


Pipeline schema:

MareNostrum
25-40 h / 50-100 cpus

AlMx / CLL cluster

0. Receive the data:


Raw Genome Data: 120Gb

Common
Storage
50-60h / 1-10 cpus

Also data deluge appears in genomics

The DNA data deluge


comes from thousands of
sources
More than 2000
sequencing instruments
around the world
more than 15 petabytes x
year of genetic data.

And soon, tens of


thousands of sources!!!!
Image source: https://share.sandia.gov/news/resources/
news_releases/images/2009/biofuel_genes.jpg

The total computing burden is growing


DNA sequencing
is on the path to
becoming an
everyday tool in
medicine.

Computing, not sequencing, is now the slower


and more costly aspect of genomics research.

How can we help at BSC?


Something must be done now, or else well need to put
vital research on hold while the necessary
computational techniques catch upor are invented.
What is clear is that it will involve both better algorithms
and a renewed focus on such big data approaches in
managing and processing data.
How?
Doing outstanding research to speedup this process

23

What is the time required to retrieve information?

1 Petabyte = 1000 x

(1 Terabyte )

What is the time required to retrieve information?

assume
100MB/sec

What is the time required to retrieve information?

assume
100MB/sec
scanning 1 Terabyte:

more than 5 hours

What is the time required to retrieve information?

scanning 1 Petabyte:

more than 5.000 hours

Solution?

massive parallelism
not only in computation but also in storage

assume 10.000 disks:


scanning 1 TB takes 1 second
Source: hVp://www.google.com/about/datacenters/gallery/images/_2000/IDI_018.jpg

Talk outline
! We talk about Petabyte?

29

Research in Big Data at BSC


To support this massive data parallelism & distribution it is
necessary to redefine and improve:

Data Processing across hundreds of


thousands of servers
Data Management across hundreds
of thousands of data devices
Dealing with new System Data
Infrastructure
30

How?

How do companies like


google read and process
data from 10.000 disks in
parallel?
Source: hVp://www.google.com/about/datacenters/gallery/images/_2000/IDI_018.jpg

GOOGLE: New programming model

To meet the challenges: MapReduce


Programming Model introduced by Google in
early 2000s to support distributed computing
(special emphasis in fault-tolerance)

Ecosystem of big data processing tools


open source, distributed, and run on commodity
hardware.

MapReduce: some details


The key innovation of
MapReduce is
the ability to take a query over a
data set, divide it, and run it in
parallel over many nodes.

Input Data

Mappers

Two phases
Map phase
Reduce phase

Reducers

Output Data

Limitations of MapReduce as a Programming model?


MapReduce is great but not every one
is a MapReduce expert

I am a python expert but .


There is a class of algorithms that
cannot be efficiently implemented with
the MapReduce programming model

Different programming models deal


with different challenges
Example: pyCOMPSs from BSC

Input Data

Output Data

OmpSs/COMPSs
34

Big Data resource management


Big Data characteristics

Requirements from data store

Volume

Scalability

Variety

Scheme-less

Velocity

Relaxed consistency & capacity to digest

Relational databases are not suitable for Big Data problems

Lack of horizontal scalability


Complex structures to express complex relationships
Hard consistency

Non-relational databases (NoSQL) are the alternative data store


Relaxing consistencyEventual consistency
35

General view of NoSQL storage (and replication)

1
2
1

2 3

1
2
1

2 5

4 5

3 4

3 4

3 5

Big Data resource management: open issues in NoSQL


Query performance depends heavily on data model
Designed to support many concurrent short queries
Solutions:
automatic configuration, query plan and data organization
BSC- Aeneas: https://github.com/cugni/aeneas

Model_A
Model_B
Model_C

Client

Query
Driven

Query
Plan

New System Data Infrastructure required


Example: Current computer systems available at genomics
research institutions are commonly designed to run general
computational jobs where,
Traditionally the limiting resource is the use of CPU.
Also, we find a large common storage space shared for all nodes.

Example: Computers in use for bioinformatic jobs


Typical mix of such computer systems and common
bioinformatics applications:
Bottlenecks & underutilization
Figure 16. EDW Applianceproblems
is a loosely coupled, shared nothing, MPP architecture

First approach:
Big Data Rack Architecture:
Shared Nothing

Storage Technology: Non Volatile Memory evolution

Evolution of Flash Adoption


FLASH + DISK

FLASH AS DISK

FLASH AS
MEMORY

SNIA NVM Summit

April 28, 2013

(*) HHD 100 cheaper than RAM . But 1000 times slower

40

Example: Computers in use for bioinformatic jobs


Jobs are responsible of managing the input data: partitions,
organisation, merge of intermediate results
Large parts of code are not functional, but housekeeping tasks

Solutions: Active storage strategies for leveraging highperformance in-memory key/value databases to accelerate
data intensive tasks
Compute Dense
Compute Fabric

Active Storage Fabric

Archival Storage
Disk/Tape

Important: Remote Nodes Have Gotten Closer


Interconnects have
become much faster
IB latency 2000 ns is
only 20x slower that
RAM and is 100x
faster that SSD

Source: http://www.slideshare.net/blopeur/hecatonchire-kvm-forum2012benoithudzia
42

Conclusion: Paradigm shift


Old

New

Compute-centric Model

Data-centric Model

Manycore

FPGA

Massive Parallelism
Persistent Memory

Flash

Phase Change

Source: Heiko Joerg http://www.slideshare.net/schihei/petascale-analytics-the-world-of-big-data-requires-big-analytics

Conclusions: How can we help?

How can IT researchers help scientists like you


cope with the onslaught of data?
This is a crucial question and there is no definitive answer yet.
What is clear is that it will involve both better algorithms and a
renewed focus on big data approaches such as: data
infrastructure, data managing and data processing.

Questions & Answers

Over to you,
what do you think?
Thank you for your attention! - Jordi

45

Thank you to

46

More information

Updated information will be posted at


www.JordiTorres.eu

47

You might also like