Storage Solutions For Bioinformatics: Li Yan

Storage Solutions for
Bioinformatics
Li Yan
Director of FlexLab, Bioinformatics core technology laboratory
liyan3@genomics.cn
http://www.genomics.cn/FlexLab/index.html
Science and Technology Division, BGI-Shenzhen

OUTLINE
• Background
• Hardware Infrastructure of Data Storage
• Data Management
• Data Storage Architecture In BGI
• Distributed Computing on Storage Server
Background:
Fast Growing Big Data
Background
Fast growing big data
• From small genomes to large complex genomes
 E. coli Genome: 4.9M
 Caenorhaditis elegans Genome: 100M
 Human Genome: 3G
 Wheat Genome: 16G
 Salamander: 45G
• From one sample to populations
 Human Genome: 3 billion DNA subunits (A,T,C,G)

 80~100X Sequencing: 600GB Raw data for individual study
 1000 Genome Project: 600TB Raw data for population study
• From the first generation sequencing to the second generation sequencing

Long-Term Data Storage Needs
• Properly secure the data
 Plan for data redundancy, which generally means we mirror data with
two or more copies
• Available(24x7x365) for all kinds of uses

 Readily accessible and in the right format
• Fast Data Transfer for collaborations

 Fast Network server(Aspera) instead of mailing a hard drive
• Scalable, easy to scale up

 Choosing reliable file systems
Hardware infrastructure
of data storage
Type of Storage infrastructure
• Disk library
• A high-capacity storage system that holds a quantity of CD-ROM, DVD or magneto-
optic (MO) disks in a storage rack and feeds them to one or more drives for reading
and writing.
• Magnetic tape
• A high-capacity data storage system for storing, retrieving, reading and writing
multiple magnetic tape cartridges.
• Redundant array of independent disks (RAID)
• RAID is a storage technology that combines multiple disk drive components into a
logical unit
• Direct-attached storage (DAS)
• a digital storage system directly attached to a server or workstation, without a
storage network in between
• Network-attached storage (NAS)
• Network-attached storage (NAS) is file-level computer data storage connected to a
computer network providing data access to heterogeneous clients.
• Storage area network (SAN)
• A storage area network (SAN) is a dedicated network that provides access to
consolidated, block level data storage.
Type of Storage Pros Cons General use
Disk library •Fast •Not as easily accessible as DAS •Disk-to-disk backup

•High storage capacity •Intended for write once, read •Archiving
•High data availability rarely info •Near line storage
Magnetic tape •Low cost per megabytes •Inconvenient for fast recovery •Archiving
•Portable of individual or group files •Limited-budget
•Unlimited capacity (with businesses
multiple tapes) •Offsite storage
Redundant •Fast •Possible false sense of security •Swap files

array of •High storage capacity •Some recovery difficulty on •Internet service
independent •High data availability some systems providers
disks (RAID) •Reliable •High cost for optimum systems •Redundant storage
•Security
•Fault tolerance
Type of Storage Pros Cons General use
Direct-attached •Simple •Needs separate storage •Data and application

storage (DAS) •Low starting cost for each server sharing
•Easy to use •Not easy to transfer •Data backup
data in network •Archiving
•Server takes
application processing
load
Network- •Fast file access for •Less convenient than •Backup
attached multiple clients SAN for moving large •Archiving
storage (NAS) •Ease of data sharing blocks of data •Redundant storage
•High storage capacity
•Redundancy
•Ease of drive mirroring
•Consolidated resources
Storage area •Excellent for moving •Expensive •Large databases
network (SAN) large blocks of data •Lack of standardization •Bandwidth-intensive
•Exceptional reliability •Management applications
•Easily availible complexity •Mission-critical
•Fault tolerance applications
•Scalability
Software Level of Data storage
Data flow of NGS
Alignment
Assembly
Association
Raw Data
Sequencer
Complex workflow
• Annotation of features
• Variations/Mutations
• Protein Structural
• Gene Expressions
• Function Networks
Data Store Meaningful Biology Data

Data Management
 Classify the data into different levels
 First Level of Storage: Dynamic, fast, Temporary
 Secondary Level of storage: Slower than first level, but enduring and
safety
 Third Level of storage: High capacity medium for backups and
archives
 Choosing file systems
 Current popular distributed file systems include: Lustre, HDFS,
MogileFS, FreeNAS, FastDFS, OpenAFS, MooseFS, pNFS, and
GoogleFS.
Classify the data into different levels
• First Level of Storage: Dynamic, fast, Temporary
• intermediate results of data analysis
• Reference data
• …
• Secondary Level of storage: Slower than first level, but enduring
and safety
• Sequencing raw data
• Meaningful data
• Third Level of storage: High capacity medium for backups and
archives
• Backups and archives of raw data and meaningful data
Distributed File systems
• Lustre
lustre is a large, safe and reliable, highly available cluster file system, which is
developed and maintained by the SUN. Lustre can support more than 10,000 nodes,
the number to the number of PB storage system.
• Hadoop(HDFS)
Hadoop and not just a hadoop distributed file system for storage, but designed for
general-purpose computing device in the form of large-scale distributed applications
running on the cluster framework.
• OneFS
OneFS enables to scale data access capacity to more than 1.6 petabytes and up to 10
Gb/sec of throughput for a single cluster capacity of up to 10 GBS (Gigabytes per
second) of throughput.
Distributed file systems
Storage Server
Distributed File systems
• MogileFS (www.danga.com)
• FreeNAS ( www.openqrm.org )
• FastDFS (code.google.com / p / fastdfs)
• OpenAFS ( www.openafs.org )
• MooseFS (derf.homelinux.org)
• pNFS ( www.pnfs.com )
• GoogleFS
Data compression&& Data security
 Data compression
 Common used:
 Lemple-Ziv, BWT
 Exclusive used for DNA sequences:
 Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp,
sam_comp
 Data security
 Raid system failure/ Redundancy
 File system
 Network
Data Storage Architecture
In BGI
Data Storage Architecture In BGI
Two Copies
Write
Write
Write Tape Library

Read
Sequencers
Compute Nodes
Two Copies
Write
Write
Write Tape Library

Read
Sequencers
Compute Nodes
First Level Storage
Two Copies
Write
Write
Second Level Storage
Write Tape Library
Read
Sequencers
Compute Nodes
Two Copies
Write
Write
Write Tape Library

Read
Third Level Storage
Sequencers
Compute Nodes
Two Copies
Write
Write
Write Tape Library

Read
Sequencers
Compute Nodes
Distributed Computing
on Storage Server
Traditional Genome Assembly
Costly, Unscalable
NGS read file
Sequence Assembly
Large memory
server
>500GB
Storage Users 26
Distributed Genome Assembly
Several storage server (IBM3630*16 for human genome)
Assembly
……
Cost effectively, Scalable

Hecate
Constructing de bruijn Graph
Solving Tiny Repeats Merging Bubbles
Scaffolding Merging Contigs

Reads
Gaea 2.1 Reference genome
Preprocessing
Distributed Indexing
for load balancing
Flexible splitting Locating

tolerates more
mistmatches
Aligning
Dynamic
Programming for
robust gap alignment
SNP calling
Standard mapping
quality for SNP calling 29
Q&A

Storage Solutions For Bioinformatics: Li Yan

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Storage Solutions For Bioinformatics: Li Yan

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Storage Solutions For Bioinformatics: Li Yan

Uploaded by

Copyright:

Available Formats

Storage Solutions for

Science and Technology Division, BGI-Shenzhen

• From one sample to populations

 Human Genome: 3 billion DNA subunits (A,T,C,G)

• From the first generation sequencing to the second generation sequencing

• Available(24x7x365) for all kinds of uses

• Fast Data Transfer for collaborations

• Scalable, easy to scale up

Disk library •Fast •Not as easily accessible as DAS •Disk-to-disk backup

Redundant •Fast •Possible false sense of security •Swap files

Direct-attached •Simple •Needs separate storage •Data and application

Data Store Meaningful Biology Data

Write Tape Library

Write Tape Library

Write Tape Library

Write Tape Library

NGS read file

Cost effectively, Scalable

Solving Tiny Repeats Merging Bubbles

Scaffolding Merging Contigs

Gaea 2.1 Reference genome

Flexible splitting Locating

You might also like