DSTN Merged - CSI ZC447 ES ZC447IS ZC447SS ZC447 - CS-1&2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 108

Data Storage Technology and Networks

CSI ZC447/ ES ZC447/IS ZC447/SS ZC447


BITS Pilani Sourish Banerjee
sbanerjee@wilp.bits-pilani.ac.in
Pilani Campus
BITS Pilani
Pilani Campus

Data Storage Technology and Networks


CSI ZC447/ ES ZC447/IS ZC447/SS ZC447
CS 1 & 2
Refer to course hand out for
• Course Introduction
• Books
• Course Structure
• Plan of Lectures
• Evaluation Scheme

Note: No re-attempts are allowed for missed EC 1 components.

BITS Pilani, Pilani Campus


Contact List of Topic Title Topic Text/Ref Book/
Hour # external
resource
1 Computer system architecture: Memory Bandwidth 1.1 Class notes
requirements, Memory hierarchy of a computer
system
2 Hard Disk Drive (HDD) : Disk geometry, Disk 1.2 T1: Ch-2
characteristics, Disk Access time and Disk performance
parameters

3 Solid State Device (SSD): Flash Memory: NAND and 1.3 T1: Ch-2
NOR Organization, R/W performance of Flash
memory
4 Array of Disks: Disk Reliability and different RAID 1.4 T1: Ch-4, T2:
Levels (0,1,2,3,4,5,6,1+0,0+1), RAID performance Ch-2 [2.5]
parameters, RAID Implementations

BITS Pilani, Pilani Campus


Computer System Architecture

– Memory Hierarchy

Address
Control
Processor
Data bus
Main
Registers Memory
Address
Control
Data bus

Printer, Modems, Secondary Storage,


I/O Monitor etc.

5
BITS Pilani, Pilani Campus
6
BITS Pilani, Pilani Campus
7
BITS Pilani, Pilani Campus
I/O Techniques

– Polling
– Interrupt driven
– DMA

8
BITS Pilani, Pilani Campus
I/O Techniques

• Polling:
– CPU check on the status of an I/O device by reading
a memory address which is associated with an I/O
device
– Pseudo-asynchronous
• Processor inspects (multiple) devices in rotation
– Cons
• Processor may still be forced to do useless work or wait or
both
– Pros
• CPU can determines how often it needs to poll 9
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
I/O Techniques

• Interrupts:
– Processor initiates I/O by requesting an operation
with the device.
– May disconnect if response can’t be immediate,
which is usually the case
– When device is ready with a response it interrupts
the processor.
• Processor finishes I/O with the device.
– Asynchronous but
• Data transfer between I/O device and memory still
requires processor to execute instructions. 11
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
I/O Techniques: Interrupts

13
BITS Pilani, Pilani Campus
I/O Techniques

• Direct Memory Access


– Processor initiates I/O
– DMA controller acts as an intermediary:
• interacts with the device,
• transfers data to/from memory as appropriate, and
• interrupts processor to signal completion.
– From the processor’s perspective DMA controller is
yet another device
• But one that works at semiconductor speeds

14
BITS Pilani, Pilani Campus
I/O Techniques

• I/O Processor
– More sophisticated version of DMA controller with
the ability to execute code: execute I/O routines,
interact with the O/S etc

15
BITS Pilani, Pilani Campus
Three Tier Architecture

• Computing
– Apps such as web servers, video conferencing,
database server, streaming etc.
• Networking
– Provides connectivity between computing nodes
– e.g. web service running on a computing node talks
to a database service running on another computer
• Storage (Persistent + Non-Persistent)
– All data resides
16
BITS Pilani, Pilani Campus
Memory Requirements

• Per computation data (non-persistent) and


Permanent (Persistent) data

• Separate memory/storage required for both


– Technology driven
• Volatile vs. Non-volatile
– Cost driven
• Faster and Costlier vs. Slower and Cheaper

17
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[1]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns

18
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[2]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns
Instructions per 0.1 1 2 (4-way) 8 (quad core,2
cycle threads/core

19
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[3]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns
Instructions per 0.1 1 2 (4-way) 8 (quad core,2
cycle threads/core

Instructions per second = Cycles per second * Instructions per cycle

Instructions per 4 * 105 40 * 106 2 * 109 20 * 1010


second

20
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[4]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)

Instructions per 4 * 105 40 * 106 2 * 109 20 * 1010


second

Instruction Size 3.8B 4B 4B 4B

Operands in 1.8 *4B 0.3 *4B 0.25*4B 0.25*4B


memory per
instruction

21
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[5]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)

Instructions per 4 * 105 40 * 106 2 * 109 20 * 1010


second

Instruction Size 3.8B 4B 4B 4B

Operands in 1.8 *4B 0.3 *4B 0.25*4B 0.25*4B


memory per
instruction
BW Demand = Instructions per second * (Instruction size + Operand size)

BW Demand 4.4 Mbps 208 Mbps 10 Gbps 100 Gbps

22
BITS Pilani, Pilani Campus
Memory Hierarchy[1]

• How do we meet memory BW requirements?


– Observation
• A typical data item may be accessed more than once

– Locality of reference
• Memory references are clustered to either a small region
of memory locations or same set of data accessed
frequently

23
BITS Pilani, Pilani Campus
Memory Hierarchy[2]

• How do we meet memory bandwidth requirements?


• Multiple levels
– Early days – register set, primary, secondary and archival
– Present day- register set, L1 cache, L2 cache, DRAM, direct
attached storage, networked storage and archival storage
• Motivation
– Amortization of cost
– As we move down the hierarchy cost decreases and speed
decreases.

24
BITS Pilani, Pilani Campus
Memory Hierarchy[3]

• Multi-Level Inclusion Principle


– All the data in level h is included in level h+1
• Reasons?
– Level h+1 is typically more persistent than level h.
– Level h+1 is order(s) of magnitude larger.
– When level h data has to be replaced (Why?)
• Only written data needs to be copied.
• Why is this good savings?

25
BITS Pilani, Pilani Campus
RESUME HERE

26
BITS Pilani, Pilani Campus
Memory Hierarchy:
Performance
• Exercise:
– Effective Access time for 2-level hierarchy

27
BITS Pilani, Pilani Campus
Memory Hierarchy: Memory
Efficiency
• Memory Efficiency
– M.E. = 100 * (Th/Teff)
– M.E. = 100/(1+Pmiss (R-1)) [R = Th+1/Th]
• Maximum memory efficiency
– R = 1 or Pmiss = 0
– Consider
• R = 10 (CPU/SRAM)
• R = 50 (CPU/DRAM)
• R = 100 (CPU/Disk)
• What will be the Pmiss for ME = 95% for each of these? 28
BITS Pilani, Pilani Campus
Memory Technologies-
Computational
• Cache between CPU registers and main memory
– Static RAM (6 transistors per cell)
– Typical Access Time ~10ns
• Main Memory
– Dynamic RAM (1 transistor + 1 capacitor)
– Capacitive leakage results in loss of data
• Needs to be refreshed periodically – hence the term
“dynamic”
– Typical Access Time ~50ns
– Typical Refresh Cycle ~100ms.
29
BITS Pilani, Pilani Campus
Memory Technologies-
Persistent
• Hard Disks
– Used for persistent online storage
– Typical access time: 10 to 15ms
– Semi-random or semi-sequential access:
• Access in blocks – typically – of 512 bytes.
– Cost per GB – Approx. Rs 5.50

• Flash Devices (Solid State Drive)


– Electrically Erasable Programmable ROM
– Used for persistent online storage
– Limit on Erases – currently 100,000 to 500,000
– Read Access Time: 50ns
– Write Access Time: 10 micro seconds
– Semi-random or semi-sequential write:
• Blocks – say 512 bits.
– Cost Per GByte – U.S. $5.00 (circa 2007) 30
BITS Pilani, Pilani Campus
Memory Technologies-Archival

• Magnetic Tapes
– Access Time – (Initial) 10 sec.; 60Mbps data transfer
– Density – up-to 6.67 billion bits per square inch
– Data Access – Sequential
– Cost - Cheapest

31
BITS Pilani, Pilani Campus
Caching

• L1, L2, and L3 caches between CPU and RAM


– Transparent to OS and Apps.
– L1 is typically on (processor) chip
• R=1 to 2
• May be separate for data and instructions (Why?)
• L3 is typically on-board (i.e. processor board
or “motherboard”)
– R=5 to 10

32
BITS Pilani, Pilani Campus
Caching- Generic [1]

• Caching as a principle can be applied between any two levels of


memory
– e.g. Buffer Cache (part of RAM)
– transparent to App, maintained by OS, between main memory and
hard disk,
• R RAM,buffer = 1
• e.g. Disk cache
– between RAM and hard disk
– typically part of disk controller
– typically semiconductor memory
– may be non-volatile ROM on high end disks to support power
breakdowns.
– transparent to OS and Apps 33
BITS Pilani, Pilani Campus
Storage Devices (Secondary Storage)

– Hard Disk Drive or Hard Disk or Disk Drive

Image source: en.wikipedia.org


34
BITS Pilani, Pilani Campus
Disk Storage

• Electromechanical hard disk drive called as a disk


drive, hard drive, or hard disk drive (HDD)
• Originally meant for PCs and mainframes
– 14 in. diameter for mainframes in 60’s
– 3.5 in. diameter for PCs from 80’s
• Now available in various shapes:
– Mini disks (2.5 in. dia.) for laptops, gaming consoles and as
external pocket storage
– Micro disks (1.68 in. or 1.8 in. dia) for iPods / Cameras /
other handheld devices
• Note: Size of diameter is also referred as form factor 35
BITS Pilani, Pilani Campus
Disk Drive: Geometry

Major Components

Platters

Read/write heads

Actuator assembly

Spindle motor

Source: dataclinic.co.uk
36
BITS Pilani, Pilani Campus
Disk Drive: Geometry

• Disk substrate coated with magnetizable material (iron


oxide…rust)
• Early days substrate is made of aluminium
• Now glass
– Improved surface uniformity
• Increases reliability
– Reduction in surface defects
• Reduced read/write errors
– Lower flight heights (r/w heads fly at some height from disk surface )
– Better stiffness
– Better shock/damage resistance 37
BITS Pilani, Pilani Campus
Data Organization and
Formatting
• Concentric rings or tracks
– Gaps between tracks
– Reduce gap to increase capacity
– Same number of bits per track
(variable packing density)
– Constant angular velocity
• Tracks divided into sectors
• Minimum block size is one
sector
• May have more than one
sector per block
• Individual tracks and sectors
are addressable
38
BITS Pilani, Pilani Campus
39
BITS Pilani, Pilani Campus
40
BITS Pilani, Pilani Campus
Zoned Disk Drive

Each track in a zone has the same number of sectors, determined


by the circumference of innermost track.

41
BITS Pilani, Pilani Campus
42
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Disk Characteristics
• Fixed (rare) or movable head
– Fixed head
• One r/w head per track mounted on fixed ridged arm
– Movable head
• One r/w head per side mounted on a movable arm
• Removable or fixed
• Single or double (usually) sided
• Single or multiple platter
– Heads are joined and aligned
– Aligned tracks on each platter form cylinders
– Data is striped by cylinder
• Reduces head movement
• Increases speed (transfer rate)
• Head mechanism
– Contact (Floppy)
– Fixed gap
44
– Flying (Winchester)
BITS Pilani, Pilani Campus
Capacity
• Capacity: maximum number of bits that can be stored.
• Vendors express capacity in units of gigabytes (GB),
where 1 GB =109 Byte
• Capacity is determined by these technology factors:
– Recording density (bits/in): number of bits that can be
squeezed into a 1 inch segment of a track.
– Track density (tracks/in): number of tracks that can be
squeezed into a 1 inch radial segment.
– Areal density (bits/in^2): product of recording and track
density.
• Modern disk drives can have more than 1 tera bit of areal density

45
BITS Pilani, Pilani Campus
Computing Disk Capacity
• Capacity =(# bytes/sector) x
(# sectors/track) x
(# tracks/surface) x
(# surfaces/platter) x
(# platters/disk)
• Example:
– 512 bytes/sector
– 300 sectors/track
– 20,000 tracks/surface
– 2 surfaces/platter
– 5 platters/disk
– Capacity = 512 x 300 x 20000 x 2 x 5 = 30.72GB 46
BITS Pilani, Pilani Campus
Disk Drive- Addressing
• Access is always in group of 1 or more contiguous sectors
– Starting sector address must be specified for access
• Addressing:
– Cylinder, Head, Sector (CHS) addressing
– Logical Block Addressing (LBA)
• Issues in LBA:
– Bad sectors (before shipping)
• Address Sliding / Slipping could be used – skip bad sectors for
numbering
– Bad Sectors (during operation)
• Mapping – maintain a map from logical number to physical CHS
address
• Remap when you have a bad sector – use reserve sectors 47
BITS Pilani, Pilani Campus
Disk Drive – Access Time
• Read and writing in sector-sized blocks
– Typically 512 bytes
• Access time (Seek time + rotational delay)
– Seek time (Average seek time typically 6 to 9 ms)
– Rotational latency: (Average latency = ½ rotation)
• Transfer time (T = b/r*N)
• Total average access time
Ta = Ts+ 1/2r+b/r*N
– Here Ts is Average seek time
– r is rotation speed in revolution per second
– b number of bytes to be transferred
– N number of bytes on a track 48
BITS Pilani, Pilani Campus
Disk Access Time Example
• Disk Parameters:
• Transfer size is 8K bytes
• Advertised average seek is 12 ms
• Disk spins at 7200 RPM
• Transfer rate is 4 MB/sec
• Controller overhead is 2 ms
• Assume that disk is idle so no queuing delay
• What is Average Disk Access Time for a Sector?
• Ave seek + ave rot delay + transfer time + controller overhead
• 12 ms + 0.5/(7200 RPM/60) + 8 KB/4 MB/s + 2 ms
• 12 + 4.15 + 2 + 2 = 20 ms
• Advertised seek time assumes no locality: typically 1/4 to 1/3
advertised seek time: 20 ms => 12 ms 49
BITS Pilani, Pilani Campus
Disk Drive Performance
• Parameters to measure the performance
– Seek time
– Rotational Latency
– IOPS (Input/Output Operations per Second)
• Read
• Write
• Random
• Sequential
• Cache hit
• Cache miss
– MBps
• # of Mega bytes per second a disk drive can perform
• Useful to measure sequential workloads like media streaming 50
BITS Pilani, Pilani Campus
Disk Drive Performance: Eye
Opener Facts!
• Access time (Seek + Rotational ) rating
– Important to distinguish between sequential and
random access request set
• Usually vendors quote IOPS numbers to impress
– Important to note that whether the IOPS numbers
being quoted are cache hits or cache misses
• Real world workload is a mix of accesses with
– Read, Write, random, sequential, cache hit, cache
miss
51
BITS Pilani, Pilani Campus
Flash Memory

• Flash chips are arranged into blocks which are typically 128KB on
NOR and 8KB on NAND flash

• All flash cells are preset to one. These cells can be individually
programmed to zero
– Burt resetting bits from zero to one cannot be done individually, can
be done only by resetting or erasing a complete block

• The lifetime of a flash chip is measured in such erase cycles, with


the typical lifetime being 100,000 erases per block
– Erase count per block should be evenly distributed across all the blocks
for better life time of the flash chip
– Process is known as “wear leveling” 52
BITS Pilani, Pilani Campus
Traditional File System: Erase-
Modify-Write back
• Use 1:1 mapping from the emulated block device
to the flash chip
– read the whole erase block, modifying the
appropriate part of the buffer, erase and rewrite the
entire block
• No wear leveling…!
• Unsafe due to power loss between erase and write back
• Slightly better method
– by gathering writes to a single erase block and only
performing the erase/modify/write back procedure
when a write to a different erase block is requested. 53
BITS Pilani, Pilani Campus
Resume here!!

54
BITS Pilani, Pilani Campus
wear leveling

• How to provide wear leveling…?


– emulated block device are stored in varying
locations on the physical medium
– needs to keep track of the current location of each
sector in the emulated block device
• Use of Translation Layer to keep track of current mapping
– It is a form of Journaling File System

55
BITS Pilani, Pilani Campus
JFFS Storage Format

• It is a log structured file system


• Nodes containing data and metadata are stored
on the flash chips sequentially, progressing
strictly linearly through the storage space
available.

56
BITS Pilani, Pilani Campus
Wear Leveling

57
BITS Pilani, Pilani Campus
Flash Memory: Operations

• Out-place updating is usually done to avoid erasing


operations on every write
– Latest copy of data is “live”
• Page meta-data
– “Live” vs “dead”
– “Free” pages are unused pages
• Cleaning
– Required when free pages are not available
– Erasure is done per block
• May require copying of live data
• Some blocks may be erased more than others
– “Worn” pages
58
BITS Pilani, Pilani Campus
File System and Flash Access

• Although File Systems abstract out device details


– Most file system internals were designed for semi-
random access models:
• i.e. notions of blocks/sectors, block addresses, and buffers
tied to blocks are incorporated into many file systems
– E.g. Block devices in Unix file systems

• So, either file systems are redesigned for flash


• Or, device drivers handle flash memory access and
file systems use the same access models
– Require a Flash Translation Layer that emulates a block
device
59
BITS Pilani, Pilani Campus
Questions ?

60
BITS Pilani, Pilani Campus
CS 2

61
BITS Pilani, Pilani Campus
Storage Devices (Secondary Storage)

– Solid State Storage (SSDs and PCI expansion card)


• Flash Memory

Image Source: electronicdesign.com


62
BITS Pilani, Pilani Campus
SSD Fundamentals

– 2.5 in. and 3.5 in. form factors


– Supports SAS, SATA and FC interfaces and protocols
– Different types as Flash Memory, Phase change
Memory (PCM) Ferroelectric RAM (FRAM)
– Semiconductor based hence no mechanical parts
– Predictable performance due to no positional
latency (i.e. Seek time and Rotational latency)

63
BITS Pilani, Pilani Campus
Flash Memory
• Semiconductor based persistent storage
• Two types
– NAND and NOR flash
• Anatomy of flash memory
– Cells  Pages  Blocks
– New flash device comes with all cells set to 1
– Cells can be programmed from 1 to 0
– To change the value of cell back to 1 then we need
to erase entire block
• Can be erased at block level only!
64
BITS Pilani, Pilani Campus
Read/Write/Programming on
Flash Memory
• Read operation is the fastest operation
• First time write is very fast
– Every cell in the block is preset to 1 and can be
individually programmed to 0
– If any part of a flash memory block has already been
written to, all subsequent writes to any part of that
block will require a process called read/erase/program
• It is 100 times slower than read operation
– Erasing is a 10 times slower process than read
operation
65
BITS Pilani, Pilani Campus
NAND vs. NOR

NAND NOR
Cost per bit Low High
Capacity High Low
Read Speed Medium *High
Write Speed High Low
File Storage Use Yes No
Code Execution Hard Easy
Stand by Power Medium Low
Erase Cycles High Low

*Individual cells (in NOR) are connected in parallel which enables random reads faster

66
BITS Pilani, Pilani Campus
Anatomy of NAND Flash
• NAND Flash types
– Single level cell (SLC)
• A cell can store 1 bit of data
• Highest performance and longest life span (100,000 program/erase cycles per
cell)
– Multi level cell (MLC)
• Stores 2 bits of data per cell.
• P/E cycles = 10,000
– Enterprise MLC (eMLC)
• MLC with stronger error correction
• Heavily over-provisioned for high performance and reliability
– e.g. a 400 GB eMLC drive might actually have 800 GB of eMLC flash
– Triple level cell (TLC)
• Stores 3 bits per cell
• P/E cycles = 5,000 per cell
• High on capacity but low on performance and reliability 67
BITS Pilani, Pilani Campus
Enterprise Class SSD

• More over-provisioned capacity


– Provides Better performance and life-time
• More cache
– Any write to a block that already contains data
requires to copy the existing contents into the cache
– Helps to coalesce writes and combining writes
• More channels
– Allows concurrent I/O operations
• More comprehensive warranty
68
BITS Pilani, Pilani Campus
Hybrid Drives

• Having both rotating platter and


solid-state memory (i.e.
combination of HDD and SSD)
– Tradeoff between high capacity
and performance
• Hybrid storage technologies
– Dual drive
• Separate SSD and HDD devices are
installed in a computer
– SSHD drive
• Single drive having NAND flash
memory and HDD 69
BITS Pilani, Pilani Campus
Topics

• Disk reliability measures


• Improving Disk Reliability
– RAID Levels

Image source: msdn.microsoft.com


70
BITS Pilani, Pilani Campus
Disk Performance issues[1]

• Reliability
– Mean Time-Between-Failure (MTBF)
• e.g. 1.2 TB SAS drive states a MTBF value of 2 million hours
– Annual Failure Rate (AFR)
• To estimate the likelihood that a disk drive will fail during a
year of full use
• Individual Disk Reliability (as claimed in
manufacturer’s warranties) is often very high
– E.g. Rated: 30,000 hours In Practice: 100,000 for an
IBM disk in 80s
71
BITS Pilani, Pilani Campus
Disk Performance issues[2]

• Access Speed
– Access Speed of a pathway = Minimum speed among
all components in the path
– e.g. CPU and Memory Speeds vs. Disk Access Speeds

• Solution:
– Multiple Disks i.e. array of disks
– Issue: Reliability
• MTTF of an array = MTTF of a single disk / # disks in the
array
72
BITS Pilani, Pilani Campus
Disk Reliability

• Redundancy may be used to improve Reliability


– Device Level Reliability
• Improved by redundant disks
– This of course implies redundant data
– Data Level Reliability
• Improved by redundant data
– This of course implies additional disks

• (RAID) Redundant Array of Inexpensive Disks


– or Redundant Array of Independent Disks
• Different Levels / Modes of Redundancy
• Referred to as RAID levels
73
BITS Pilani, Pilani Campus
How to achieve reliability?

• Use more number of small sized disks !!


– What should be the number of disks?
– How small should be the disks?
– How should they be structured and used?

74
BITS Pilani, Pilani Campus
Performance Improvement in
Secondary Storage
• In general multiple components improves the
performance
• Similarly multiple disks should reduce access time?
– Arrays of disks operates independently and in parallel
• Justification
– With multiple disks separate I/O requests can be
handled in parallel
– A single I/O request can be executed in parallel, if the
requested data is distributed across multiple disks
• Researchers @ University of California-Berkeley
proposed the RAID (1988)
75
BITS Pilani, Pilani Campus
RAID

• Redundant Array of Inexpensive Disks


– Connect multiple disks together to
• Increase storage
• Reduce access time
• Increase data redundancy
• Provide fault tolerance
• Many different levels of RAID systems
• differing levels of redundancy,
• error checking,
• capacity, and cost
76
BITS Pilani, Pilani Campus
RAID Fundamentals

• Striping
– Map data to different disks
– Advantage…?
• Mirroring
– Replicate data
– What are the implications…?
• Parity
– Loss recovery/Error correction / detection

77
BITS Pilani, Pilani Campus
RAID

• Characteristics
1. Set of physical disks viewed as single logical drive
by operating system
2. Data distributed across physical drives
3. Can use redundant capacity to store parity
information

78
BITS Pilani, Pilani Campus
Data Mapping in RAID 0

No redundancy or error correction


Data striped across all disks
Round Robin striping
79
BITS Pilani, Pilani Campus
RAID 1

Mirrored Disks
Data is striped across disks
2 copies of each stripe on separate disks
Read from either and Write to both

80
BITS Pilani, Pilani Campus
Data Mapping in RAID 2

Bit interleaved data


Lots of redundancy
Use parallel access technique
Very small size strips
Expensive: Good for erroneous disk
81
BITS Pilani, Pilani Campus
Data Mapping in RAID 3
• Similar to RAID 2
• Only one redundant disk, no matter how large the array
• Simple parity bit for each set of corresponding bits
• Data on failed drive can be reconstructed from surviving data
and parity information
• Question:
• Can achieve very high transfer rates. How?

82
BITS Pilani, Pilani Campus
RAID 4
• Make use of independent access with block level striping
• Good for high I/O request rate due to large strips
• Bit by bit parity calculated across stripes on each disk
• Parity stored on parity disk
• Drawback???

83
BITS Pilani, Pilani Campus
RAID 5
• Round robin allocation for parity stripe
• It avoids RAID 4 bottleneck at parity disk
• Commonly used in network servers
• Drawback
– Disk failure has a medium impact on throughput
– Difficult to rebuild in the event of a disk failure (as
compared to RAID level 1)

84
BITS Pilani, Pilani Campus
RAID 6
• Two parity calculations
• Stored in separate blocks on different disks
• High data availability
– Three disks need to fail for data loss
– Significant write penalty
• Drawback
– Controller overhead to compute parity is very high

85
BITS Pilani, Pilani Campus
Nesting of RAID Levels:
RAID(1+0)
• RAID 1 (mirror) arrays are built first,
then combined to form a RAID 0
(stripe) array.
• Provides high levels of:
– I/O performance
– Data redundancy
– Disk fault tolerance.

86
BITS Pilani, Pilani Campus
Nesting of RAID Levels:
RAID(0+1)
• RAID 0 (stripe) arrays are built first, then
combined to form a RAID 1 (mirror) array
• Provides high levels of I/O performance
and data redundancy
• Slightly less fault tolerance than a 1+0
– How…?

87
BITS Pilani, Pilani Campus
RAID Implementations

• Software implementations are provided by many


Operating Systems.
• A software layer sits above the disk device drivers
and provides an abstraction layer between the
logical drives(RAIDs) and physical drives.
• Server's processor is used to run the RAID
software.
• Used for simpler configurations like RAID 0 and
RAID 1.
88
BITS Pilani, Pilani Campus
Comparison of the RAID levels

Question: Which RAID level should be used when ?


Answer: There is no absolute answer.

BITS Pilani, Pilani Campus


Comparison of the RAID levels cont.

• Manufacturers of disk subsystems have design


options in
• selection of the internal physical hard disks;
• I/O technique used for the communication within the disk
subsystem;
• use of several I/O channels;
• realization of the RAID controller;
• size of the cache;
• cache algorithms themselves;
• behavior during rebuild; and
• provision of advanced functions such as data scrubbing
and preventive rebuild
BITS Pilani, Pilani Campus
Creating a RAID 0 Array
• Requirements: minimum of 2 storage devices
• Primary benefit: Performance
• Things to keep in mind:
• Make sure that you have functional
backups.
• A single device failure will destroy all
data in the array.
• Identify the Component Devices
lsblk -o NAME,SIZE, FSTYPE, TYPE,
MOUNTPOINT
• create a RAID 0 array with these components
sudo mdadm --create --verbose
/dev/md0 --level=0 --raid-devices=2
/dev/sdb /dev/sdc
• Ensure that the RAID was successfully
created by checking the /proc/mdstat file
cat /proc/mdstat
BITS Pilani, Pilani Campus
Creating a RAID 0 Array
Create and Mount the Filesystem

1. Create a filesystem on the array  sudo mkfs.ext4 -F /dev/md0


2. Create a mount point to attach the
 sudo mkdir -p /mnt/md0
new filesystem
3. Mount the filesystem  sudo mount /dev/md0 /mnt/md0
4. Check whether the new space is
available  df -h -x devtmpfs -x tmpfs
5. Save the Array Layout, to make
 sudo mdadm --detail --scan | sudo tee -
sure that the array is reassembled
automatically at boot a /etc/mdadm/mdadm.conf
 Create /etc/mdadm and /etc/mdadm/mdadm.conf if not existing,
6. Update the initramfs, or initial
and re-run the command.
RAM file system, so that the array
will be available during the early  sudo update-initramfs –u
boot process  echo '/dev/md0 /mnt/md0 ext4
7. Add the new filesystem mount defaults,nofail,discard 0 0' | sudo tee -a
options to the /etc/fstab file for /etc/fstab
automatic mounting at boot
BITS Pilani, Pilani Campus
Creating a RAID 1 Array
• The RAID 1 array type is implemented by mirroring
data across all available disks.
• Each disk in a RAID 1 array gets a full copy of the
data, providing redundancy in the event of a device
failure.
• Requirements: minimum of 2 storage devices
• Primary benefit: Redundancy
• Things to keep in mind: Since two copies of the data
are maintained, only half of the disk space will be
usable
• Identify the Component Devices
lsblk -o NAME,SIZE,FSTYPE,TYPE,MOUNTPOINT
• Create the Array
sudo mdadm --create --verbose /dev/md1 --level=1
--raid-devices=2 /dev/sdd /dev/sde
• The mdadm tool will start to mirror the drives. • You can monitor the progress
• This can take some time to complete, but the array of the mirroring by checking
can be used during this time the /proc/mdstat file

BITS Pilani, Pilani Campus


Creating a RAID 1 Array
Create and Mount the Filesystem

1. Create a filesystem on the array  sudo mkfs.ext4 -F /dev/md1


2. Create a mount point to attach the
 sudo mkdir -p /mnt/md1
new filesystem
3. Mount the filesystem  sudo mount /dev/md1 /mnt/md1
4. Check whether the new space is
available  df -h -x devtmpfs -x tmpfs
5. Save the Array Layout, to make
 sudo mdadm --detail --scan | sudo tee -
sure that the array is reassembled
automatically at boot a /etc/mdadm/mdadm.conf
 Create /etc/mdadm and /etc/mdadm/mdadm.conf if not existing,
6. Update the initramfs, or initial
and re-run the command.
RAM file system, so that the array
will be available during the early  sudo update-initramfs –u
boot process  echo '/dev/md1 /mnt/md1 ext4
7. Add the new filesystem mount defaults,nofail,discard 0 0' | sudo tee -a
options to the /etc/fstab file for /etc/fstab
automatic mounting at boot
BITS Pilani, Pilani Campus
Creating a RAID 5 Array
• The RAID 5 array type is implemented by
striping data across the available devices.
• One component of each stripe is a
calculated parity block.
• If a device fails, the parity block and the
remaining blocks can be used to
calculate the missing data.
• The device that receives the parity block
is rotated so that each device has a
balanced amount of parity information.
• Requirements: minimum of 3 storage
devices
• Primary benefit: Redundancy with more
usable capacity.
• While the parity information is
distributed, one disk’s worth of capacity
will be used for parity.
• RAID 5 can suffer from very poor
performance when in a degraded state.

BITS Pilani, Pilani Campus


Creating a RAID 5 Array
• Find the identifiers for the raw disks that you will be using and create the array
lsblk -o NAME,SIZE,FSTYPE,TYPE,MOUNTPOINT
sudo mdadm --create --verbose /dev/md5 --level=5 --raid-devices=3 /dev/sdf
/dev/sdg /dev/sdh
• The mdadm tool will start to configure the array (it actually uses the recovery process to build
the array for performance reasons).
• This can take some time to complete, but the array can be used during this time.
• You can monitor the progress of the mirroring by checking the /proc/mdstat
cat /proc/mdstat

BITS Pilani, Pilani Campus


Creating a RAID 5 Array
Create and Mount the Filesystem

1. Create a filesystem on the array  sudo mkfs.ext4 -F /dev/md5


2. Create a mount point to attach the
 sudo mkdir -p /mnt/md5
new filesystem
3. Mount the filesystem  sudo mount /dev/md5 /mnt/md5
4. Check whether the new space is
available  df -h -x devtmpfs -x tmpfs
5. Save the Array Layout, to make
 sudo mdadm --detail --scan | sudo tee -
sure that the array is reassembled
automatically at boot a /etc/mdadm/mdadm.conf
 Create /etc/mdadm and /etc/mdadm/mdadm.conf if not existing,
6. Update the initramfs, or initial
and re-run the command.
RAM file system, so that the array
will be available during the early  sudo update-initramfs –u
boot process  echo '/dev/md5 /mnt/md0 ext4
7. Add the new filesystem mount defaults,nofail,discard 0 0' | sudo tee -a
options to the /etc/fstab file for /etc/fstab
automatic mounting at boot
BITS Pilani, Pilani Campus
Creating a RAID 6 Array

• The RAID 6 array type is implemented by striping data across the available devices.
• Two components of each stripe are calculated parity blocks.
• If one or two devices fail, the parity blocks and the remaining blocks can be used to
calculate the missing data.
• The devices that receive the parity blocks are rotated so that each device has a balanced
amount of parity information.
• This is similar to a RAID 5 array, but allows for the failure of two drives.
• Requirements: minimum of 4 storage devices
• Primary benefit: Double redundancy with more usable capacity.
• Things to keep in mind: While the parity information is distributed, two disk’s worth of
capacity will be used for parity.
• RAID 6 can suffer from very poor performance when in a degraded state.

BITS Pilani, Pilani Campus


Creating a RAID 6 Array
• To get started, find the identifiers for the raw disks that you will be using:
lsblk -o NAME,SIZE,FSTYPE,TYPE,MOUNTPOINT

• Create the Array


sudo mdadm --create --verbose /dev/md6 --level=6 --raid-devices=4
/dev/sdi /dev/sdj /dev/sdk /dev/sdl

• The mdadm tool will start to configure the array (it actually uses the
recovery process to build the array for performance reasons).
• This can take some time to complete, but the array can be used during this
time.
• You can monitor the progress of the mirroring by checking the /proc/mdstat
file:
cat /proc/mdstat

BITS Pilani, Pilani Campus


Creating a RAID 6 Array
Create and Mount the Filesystem

1. Create a filesystem on the array  sudo mkfs.ext4 -F /dev/md6


2. Create a mount point to attach the
 sudo mkdir -p /mnt/md6
new filesystem
3. Mount the filesystem  sudo mount /dev/md6 /mnt/md6
4. Check whether the new space is
available  df -h -x devtmpfs -x tmpfs
5. Save the Array Layout, to make
 sudo mdadm --detail --scan | sudo tee -
sure that the array is reassembled
automatically at boot a /etc/mdadm/mdadm.conf
 Create /etc/mdadm and /etc/mdadm/mdadm.conf if not existing,
6. Update the initramfs, or initial
and re-run the command.
RAM file system, so that the array
will be available during the early  sudo update-initramfs –u
boot process  echo '/dev/md6 /mnt/md0 ext4
7. Add the new filesystem mount defaults,nofail,discard 0 0' | sudo tee -a
options to the /etc/fstab file for /etc/fstab
automatic mounting at boot
BITS Pilani, Pilani Campus
Creating a Complex RAID 10
Array
• The RAID 10 array type is traditionally
implemented by creating a striped RAID 0 array
composed of sets of RAID 1 arrays.
• This nested array type gives both redundancy
and high performance, at the expense of large
amounts of disk space.
• The mdadm utility has its own RAID 10 type that
provides the same type of benefits with
increased flexibility.
• It is not created by nesting arrays, but has many
of the same characteristics and guarantees.
• Requirements: minimum of 4 storage devices
• Primary benefit: Performance and redundancy

BITS Pilani, Pilani Campus


Creating a Complex RAID 10
Array
• The amount of capacity reduction for the array is defined by the number of data copies user
chooses to keep. The number of copies that are stored with mdadm style RAID 10 is
configurable(not covered in this demonstration) . By default, two copies of each data block will
be stored in what is called the “near” layout. The possible layouts that dictate how each data
block is stored are:
• near: The default arrangement. Copies of each chunk are written consecutively when striping,
meaning that the copies of the data blocks will be written around the same part of multiple
disks.
• far: The first and subsequent copies are written to different parts the storage devices in the
array. For instance, the first chunk might be written near the beginning of a disk, while the
second chunk would be written half way down on a different disk. This can give some read
performance gains for traditional spinning disks at the expense of write performance.
• offset: Each stripe is copied, offset by one drive. This means that the copies are offset from
one another, but still close together on the disk. This helps minimize excessive seeking during
BITS Pilani, Pilani Campus
some workloads.
Creating a Complex RAID 10
Array
• To get started, find the identifiers for the raw disks that you will be using:
lsblk -o NAME,SIZE,FSTYPE,TYPE,MOUNTPOINT
• Create the Array by setting up two copies using the near layout by not specifying a layout and
copy number
sudo mdadm --create --verbose /dev/md10 --level=10 --raid-devices=4 /dev/sdm
/dev/sdn /dev/sdo /dev/sdp
• The mdadm tool will start to configure the array (it actually uses the recovery process to build
the array for performance reasons). This can take some time to complete, but the array can
be used during this time. You can monitor the progress of the mirroring by checking the
/proc/mdstat file:
cat /proc/mdstat

BITS Pilani, Pilani Campus


Creating a Complex RAID 10
Array (Advanced)
• If you want to use a different layout, or change the number of copies, you will have to use the
--layout= option, which takes a layout and copy identifier.
• The layouts are n for near, f for far, and o for offset. The number of copies to store is
appended afterwards.
• For instance, to create an array that has 3 copies in the offset layout, the command would
look like this:
sudo mdadm --create --verbose /dev/md0 --level=10 --layout=o3 --raid- devices=4
/dev/sda /dev/sdb /dev/sdc /dev/sdd

This not being demonstrated

BITS Pilani, Pilani Campus


Creating a RAID 10 Array
Create and Mount the Filesystem

1. Create a filesystem on the array  sudo mkfs.ext4 -F /dev/md10


2. Create a mount point to attach the
 sudo mkdir -p /mnt/md10
new filesystem
3. Mount the filesystem  sudo mount /dev/md10 /mnt/md10
4. Check whether the new space is
available  df -h -x devtmpfs -x tmpfs
5. Save the Array Layout, to make
 sudo mdadm --detail --scan | sudo tee -
sure that the array is reassembled
automatically at boot a /etc/mdadm/mdadm.conf
 Create /etc/mdadm and /etc/mdadm/mdadm.conf if not existing,
6. Update the initramfs, or initial
and re-run the command.
RAM file system, so that the array
will be available during the early  sudo update-initramfs –u
boot process  echo '/dev/md10 /mnt/md0 ext4
7. Add the new filesystem mount defaults,nofail,discard 0 0' | sudo tee -a
options to the /etc/fstab file for /etc/fstab
automatic mounting at boot
BITS Pilani, Pilani Campus
Observe the performance in
different RAID levels

BITS Pilani, Pilani Campus


Observe the failure and
performance impact of failure

BITS Pilani, Pilani Campus


Thank You!

BITS Pilani, Pilani Campus

You might also like