Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics
in Bioinformatics
Talk outline
! We talk about Petabyte?
Deluge of Data
scientific instruments
Scientific instruments
Physics
Astronomy
$10000000.0
$1000000.0
$100000.0
Jan-13
Sep-12
Jan-12
May-12
Sep-11
May-11
Jan-11
Sep-10
May-10
Jan-10
Sep-09
May-09
Jan-09
Sep-08
May-08
Jan-08
Sep-07
May-07
Jan-07
Sep-06
May-06
Jan-06
Sep-05
May-05
Jan-05
Sep-04
Jan-04
May-04
Sep-03
May-03
Jan-03
Sep-02
May-02
Jan-02
$1000.0
Sep-01
$10000.0
Image source: Big Data in biomedicine, Drug Discovery Today. Fabricio F. Costa (2013)
wont be moved!
http://www.google.es/url?
sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&docid=FsNQD0JQszy2oM&tbnid=gsXeMMans2soOM:&ved=0CAUQjRw&
url=http%3A%2F%2Fwww.tbase.com%2Fcorporate%2Fprivacy-andsecurity&ei=M__3Ur2EK4bQtAbSx4CoDA&psig=AFQjCNGD5uUaqRoyOqw687J-ATfdtBHTrA&ust=1392070808227629
Source: http://www.custodia-documental.com/wp-content/uploads/Cloud-Big-Data.jpg
Security
Privacy
Lack
of
Standards
Data
Integrity
Regulatory
Data
Recovery
Control
Vendor
Maturity
...
Source: http://www.theatlantic.com/health/archive/2012/05/big-data-can
-save-health-care-0151-but-at-what-cost-to-privacy/257621/
The
informa=on
is
non
ac=onable
knowledge
(*) Big Data definition?
Data
+
Volume
Value
InformaMon
Knowledge
15
16
Talk outline
! We talk about Petabyte?
17
Computing Challenge
Researchers need to crunch a large amount of data very
quickly (and easily) using high-performance computers
19
MareNostrum
25-40
h
/
50-100
cpus
Common
Storage
50-60h
/
1-10
cpus
23
1 Petabyte = 1000 x
(1 Terabyte )
assume
100MB/sec
assume
100MB/sec
scanning 1 Terabyte:
scanning 1 Petabyte:
Solution?
massive parallelism
not only in computation but also in storage
Talk outline
! We talk about Petabyte?
29
How?
Input Data
Mappers
Two phases
Map phase
Reduce phase
Reducers
Output Data
Input Data
Output Data
OmpSs/COMPSs
34
Volume
Scalability
Variety
Scheme-less
Velocity
1
2
1
2 3
1
2
1
2 5
4 5
3 4
3 4
3 5
Model_A
Model_B
Model_C
Client
Query
Driven
Query
Plan
First approach:
Big Data Rack Architecture:
Shared Nothing
FLASH AS DISK
FLASH AS
MEMORY
(*) HHD 100 cheaper than RAM . But 1000 times slower
40
Solutions: Active storage strategies for leveraging highperformance in-memory key/value databases to accelerate
data intensive tasks
Compute Dense
Compute Fabric
Archival Storage
Disk/Tape
Source: http://www.slideshare.net/blopeur/hecatonchire-kvm-forum2012benoithudzia
42
New
Compute-centric Model
Data-centric Model
Manycore
FPGA
Massive Parallelism
Persistent Memory
Flash
Phase Change
Over to you,
what do you think?
Thank you for your attention! - Jordi
45
Thank you to
46
More information
47