Storage Solutions For Bioinformatics: Li Yan
Storage Solutions For Bioinformatics: Li Yan
Storage Solutions For Bioinformatics: Li Yan
Bioinformatics
Li Yan
Director of FlexLab, Bioinformatics core technology laboratory
liyan3@genomics.cn
http://www.genomics.cn/FlexLab/index.html
• Background
• Hardware Infrastructure of Data Storage
• Data Management
• Data Storage Architecture In BGI
• Distributed Computing on Storage Server
Background:
Fast Growing Big Data
Background
Fast growing big data
• From small genomes to large complex genomes
E. coli Genome: 4.9M
Caenorhaditis elegans Genome: 100M
Human Genome: 3G
Wheat Genome: 16G
Salamander: 45G
Magnetic tape •Low cost per megabytes •Inconvenient for fast recovery •Archiving
•Portable of individual or group files •Limited-budget
•Unlimited capacity (with businesses
multiple tapes) •Offsite storage
Association
Raw Data
Sequencer
Complex workflow
• Annotation of features
• Variations/Mutations
• Protein Structural
• Gene Expressions
• Function Networks
• MogileFS (www.danga.com)
• FreeNAS ( www.openqrm.org )
• FastDFS (code.google.com / p / fastdfs)
• OpenAFS ( www.openafs.org )
• MooseFS (derf.homelinux.org)
• pNFS ( www.pnfs.com )
• GoogleFS
Data compression&& Data security
Data compression
Common used:
Lemple-Ziv, BWT
Exclusive used for DNA sequences:
Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp,
sam_comp
Data security
Raid system failure/ Redundancy
File system
Network
Data Storage Architecture
In BGI
Data Storage Architecture In BGI
Two Copies
Write
Write
Sequencers
Compute Nodes
Data Storage Architecture In BGI
Two Copies
Write
Write
Sequencers
Compute Nodes
First Level Storage
Data Storage Architecture In BGI
Two Copies
Write
Write
Second Level Storage
Write Tape Library
Read
Sequencers
Compute Nodes
Data Storage Architecture In BGI
Two Copies
Write
Write
Compute Nodes
Data Storage Architecture In BGI
Two Copies
Write
Write
Sequencers
Compute Nodes
Distributed Computing
on Storage Server
Traditional Genome Assembly
Costly, Unscalable
Sequence Assembly
Large memory
server
>500GB
Storage Users 26
Distributed Genome Assembly
Several storage server (IBM3630*16 for human genome)
Assembly
……
Preprocessing
Distributed Indexing
for load balancing
Standard mapping
quality for SNP calling 29
Q&A