0% found this document useful (0 votes)
87 views30 pages

Storage Solutions For Bioinformatics: Li Yan

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 30

Storage Solutions for

Bioinformatics

Li Yan
Director of FlexLab, Bioinformatics core technology laboratory
liyan3@genomics.cn
http://www.genomics.cn/FlexLab/index.html

Science and Technology Division, BGI-Shenzhen


OUTLINE

• Background
• Hardware Infrastructure of Data Storage
• Data Management
• Data Storage Architecture In BGI
• Distributed Computing on Storage Server
Background:
Fast Growing Big Data
Background
Fast growing big data
• From small genomes to large complex genomes
 E. coli Genome: 4.9M
 Caenorhaditis elegans Genome: 100M
 Human Genome: 3G
 Wheat Genome: 16G
 Salamander: 45G

• From one sample to populations

 Human Genome: 3 billion DNA subunits (A,T,C,G)


 80~100X Sequencing: 600GB Raw data for individual study
 1000 Genome Project: 600TB Raw data for population study

• From the first generation sequencing to the second generation sequencing


Long-Term Data Storage Needs
• Properly secure the data
 Plan for data redundancy, which generally means we mirror data with
two or more copies

• Available(24x7x365) for all kinds of uses


 Readily accessible and in the right format

• Fast Data Transfer for collaborations


 Fast Network server(Aspera) instead of mailing a hard drive

• Scalable, easy to scale up


 Choosing reliable file systems
Hardware infrastructure
of data storage
Type of Storage infrastructure
• Disk library
• A high-capacity storage system that holds a quantity of CD-ROM, DVD or magneto-
optic (MO) disks in a storage rack and feeds them to one or more drives for reading
and writing.
• Magnetic tape
• A high-capacity data storage system for storing, retrieving, reading and writing
multiple magnetic tape cartridges.
• Redundant array of independent disks (RAID)
• RAID is a storage technology that combines multiple disk drive components into a
logical unit
• Direct-attached storage (DAS)
• a digital storage system directly attached to a server or workstation, without a
storage network in between
• Network-attached storage (NAS)
• Network-attached storage (NAS) is file-level computer data storage connected to a
computer network providing data access to heterogeneous clients.
• Storage area network (SAN)
• A storage area network (SAN) is a dedicated network that provides access to
consolidated, block level data storage.
Type of Storage Pros Cons General use

Disk library •Fast •Not as easily accessible as DAS •Disk-to-disk backup


•High storage capacity •Intended for write once, read •Archiving
•High data availability rarely info •Near line storage

Magnetic tape •Low cost per megabytes •Inconvenient for fast recovery •Archiving
•Portable of individual or group files •Limited-budget
•Unlimited capacity (with businesses
multiple tapes) •Offsite storage

Redundant •Fast •Possible false sense of security •Swap files


array of •High storage capacity •Some recovery difficulty on •Internet service
independent •High data availability some systems providers
disks (RAID) •Reliable •High cost for optimum systems •Redundant storage
•Security
•Fault tolerance
Type of Storage Pros Cons General use

Direct-attached •Simple •Needs separate storage •Data and application


storage (DAS) •Low starting cost for each server sharing
•Easy to use •Not easy to transfer •Data backup
data in network •Archiving
•Server takes
application processing
load
Network- •Fast file access for •Less convenient than •Backup
attached multiple clients SAN for moving large •Archiving
storage (NAS) •Ease of data sharing blocks of data •Redundant storage
•High storage capacity
•Redundancy
•Ease of drive mirroring
•Consolidated resources
Storage area •Excellent for moving •Expensive •Large databases
network (SAN) large blocks of data •Lack of standardization •Bandwidth-intensive
•Exceptional reliability •Management applications
•Easily availible complexity •Mission-critical
•Fault tolerance applications
•Scalability
Software Level of Data storage
Data flow of NGS
Alignment
Assembly

Association

Raw Data
Sequencer

Complex workflow
• Annotation of features
• Variations/Mutations
• Protein Structural
• Gene Expressions
• Function Networks

Data Store Meaningful Biology Data


Data Management
 Classify the data into different levels
 First Level of Storage: Dynamic, fast, Temporary
 Secondary Level of storage: Slower than first level, but enduring and
safety
 Third Level of storage: High capacity medium for backups and
archives
 Choosing file systems
 Current popular distributed file systems include: Lustre, HDFS,
MogileFS, FreeNAS, FastDFS, OpenAFS, MooseFS, pNFS, and
GoogleFS.
Classify the data into different levels
• First Level of Storage: Dynamic, fast, Temporary
• intermediate results of data analysis
• Reference data
• …
• Secondary Level of storage: Slower than first level, but enduring
and safety
• Sequencing raw data
• Meaningful data
• Third Level of storage: High capacity medium for backups and
archives
• Backups and archives of raw data and meaningful data
Distributed File systems
• Lustre
lustre is a large, safe and reliable, highly available cluster file system, which is
developed and maintained by the SUN. Lustre can support more than 10,000 nodes,
the number to the number of PB storage system.
• Hadoop(HDFS)
Hadoop and not just a hadoop distributed file system for storage, but designed for
general-purpose computing device in the form of large-scale distributed applications
running on the cluster framework.
• OneFS
OneFS enables to scale data access capacity to more than 1.6 petabytes and up to 10
Gb/sec of throughput for a single cluster capacity of up to 10 GBS (Gigabytes per
second) of throughput.
Distributed file systems
Storage Server
Distributed File systems

• MogileFS (www.danga.com)
• FreeNAS ( www.openqrm.org )
• FastDFS (code.google.com / p / fastdfs)
• OpenAFS ( www.openafs.org )
• MooseFS (derf.homelinux.org)
• pNFS ( www.pnfs.com )
• GoogleFS
Data compression&& Data security
 Data compression
 Common used:
 Lemple-Ziv, BWT
 Exclusive used for DNA sequences:
 Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp,
sam_comp
 Data security
 Raid system failure/ Redundancy
 File system
 Network
Data Storage Architecture
In BGI
Data Storage Architecture In BGI
Two Copies

Write

Write

Write Tape Library


Read

Sequencers

Compute Nodes
Data Storage Architecture In BGI
Two Copies

Write

Write

Write Tape Library


Read

Sequencers

Compute Nodes
First Level Storage
Data Storage Architecture In BGI
Two Copies

Write

Write
Second Level Storage
Write Tape Library
Read

Sequencers

Compute Nodes
Data Storage Architecture In BGI
Two Copies

Write

Write

Write Tape Library


Read
Third Level Storage
Sequencers

Compute Nodes
Data Storage Architecture In BGI
Two Copies

Write

Write

Write Tape Library


Read

Sequencers

Compute Nodes
Distributed Computing
on Storage Server
Traditional Genome Assembly
Costly, Unscalable

NGS read file

Sequence Assembly

Large memory
server
>500GB

Storage Users 26
Distributed Genome Assembly
Several storage server (IBM3630*16 for human genome)

Assembly
……

Cost effectively, Scalable


Hecate
Constructing de bruijn Graph

Solving Tiny Repeats Merging Bubbles

Scaffolding Merging Contigs


Reads

Gaea 2.1 Reference genome

Preprocessing

Distributed Indexing
for load balancing

Flexible splitting Locating


tolerates more
mistmatches
Aligning
Dynamic
Programming for
robust gap alignment
SNP calling

Standard mapping
quality for SNP calling 29
Q&A

You might also like