DSTN Merged - CSI ZC447 ES ZC447IS ZC447SS ZC447 - CS-1&2

Data Storage Technology and Networks
CSI ZC447/ ES ZC447/IS ZC447/SS ZC447

BITS Pilani Sourish Banerjee
sbanerjee@wilp.bits-pilani.ac.in
Pilani Campus
BITS Pilani
Pilani Campus
Data Storage Technology and Networks

CSI ZC447/ ES ZC447/IS ZC447/SS ZC447
CS 1 & 2
Refer to course hand out for
• Course Introduction
• Books
• Course Structure
• Plan of Lectures
• Evaluation Scheme
Note: No re-attempts are allowed for missed EC 1 components.
BITS Pilani, Pilani Campus

Contact List of Topic Title Topic Text/Ref Book/
Hour # external
resource
1 Computer system architecture: Memory Bandwidth 1.1 Class notes
requirements, Memory hierarchy of a computer
system
2 Hard Disk Drive (HDD) : Disk geometry, Disk 1.2 T1: Ch-2
characteristics, Disk Access time and Disk performance
parameters
3 Solid State Device (SSD): Flash Memory: NAND and 1.3 T1: Ch-2
NOR Organization, R/W performance of Flash
memory
4 Array of Disks: Disk Reliability and different RAID 1.4 T1: Ch-4, T2:
Levels (0,1,2,3,4,5,6,1+0,0+1), RAID performance Ch-2 [2.5]
parameters, RAID Implementations

Computer System Architecture
– Memory Hierarchy
Address
Control
Processor
Data bus
Main
Registers Memory
Address
Control
Data bus
Printer, Modems, Secondary Storage,

I/O Monitor etc.
5
6
7
I/O Techniques
– Polling
– Interrupt driven
– DMA
8
I/O Techniques
• Polling:
– CPU check on the status of an I/O device by reading
a memory address which is associated with an I/O
device
– Pseudo-asynchronous
• Processor inspects (multiple) devices in rotation
– Cons
• Processor may still be forced to do useless work or wait or
both
– Pros
• CPU can determines how often it needs to poll 9
I/O Techniques
• Interrupts:
– Processor initiates I/O by requesting an operation
with the device.
– May disconnect if response can’t be immediate,
which is usually the case
– When device is ready with a response it interrupts
the processor.
• Processor finishes I/O with the device.
– Asynchronous but
• Data transfer between I/O device and memory still
requires processor to execute instructions. 11
I/O Techniques: Interrupts
13
I/O Techniques
• Direct Memory Access

– Processor initiates I/O
– DMA controller acts as an intermediary:
• interacts with the device,
• transfers data to/from memory as appropriate, and
• interrupts processor to signal completion.
– From the processor’s perspective DMA controller is
yet another device
• But one that works at semiconductor speeds
14
I/O Techniques
• I/O Processor
– More sophisticated version of DMA controller with
the ability to execute code: execute I/O routines,
interact with the O/S etc
15
Three Tier Architecture
• Computing
– Apps such as web servers, video conferencing,
database server, streaming etc.
• Networking
– Provides connectivity between computing nodes
– e.g. web service running on a computing node talks
to a database service running on another computer
• Storage (Persistent + Non-Persistent)
– All data resides
16
Memory Requirements
• Per computation data (non-persistent) and

Permanent (Persistent) data
• Separate memory/storage required for both

– Technology driven
• Volatile vs. Non-volatile
– Cost driven
• Faster and Costlier vs. Slower and Cheaper
17
Memory Bandwidth
Requirement[1]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns
18
Memory Bandwidth
Requirement[2]
Instructions per 0.1 1 2 (4-way) 8 (quad core,2
cycle threads/core
19
Memory Bandwidth
Requirement[3]
Instructions per 0.1 1 2 (4-way) 8 (quad core,2
cycle threads/core
Instructions per second = Cycles per second * Instructions per cycle
Instructions per 4 * 105 40 * 106 2 * 109 20 * 1010

second
20
Memory Bandwidth
Requirement[4]

second
Instruction Size 3.8B 4B 4B 4B
Operands in 1.8 *4B 0.3 *4B 0.25*4B 0.25*4B

memory per
instruction
21
Memory Bandwidth
Requirement[5]

second
Instruction Size 3.8B 4B 4B 4B
Operands in 1.8 *4B 0.3 *4B 0.25*4B 0.25*4B

memory per
instruction
BW Demand = Instructions per second * (Instruction size + Operand size)
BW Demand 4.4 Mbps 208 Mbps 10 Gbps 100 Gbps
22
Memory Hierarchy[1]
• How do we meet memory BW requirements?

– Observation
• A typical data item may be accessed more than once
– Locality of reference
• Memory references are clustered to either a small region
of memory locations or same set of data accessed
frequently
23
Memory Hierarchy[2]
• How do we meet memory bandwidth requirements?

• Multiple levels
– Early days – register set, primary, secondary and archival
– Present day- register set, L1 cache, L2 cache, DRAM, direct
attached storage, networked storage and archival storage
• Motivation
– Amortization of cost
– As we move down the hierarchy cost decreases and speed
decreases.
24
Memory Hierarchy[3]
• Multi-Level Inclusion Principle

– All the data in level h is included in level h+1
• Reasons?
– Level h+1 is typically more persistent than level h.
– Level h+1 is order(s) of magnitude larger.
– When level h data has to be replaced (Why?)
• Only written data needs to be copied.
• Why is this good savings?
25
RESUME HERE
26
Memory Hierarchy:
Performance
• Exercise:
– Effective Access time for 2-level hierarchy
27
Memory Hierarchy: Memory
Efficiency
• Memory Efficiency
– M.E. = 100 * (Th/Teff)
– M.E. = 100/(1+Pmiss (R-1)) [R = Th+1/Th]
• Maximum memory efficiency
– R = 1 or Pmiss = 0
– Consider
• R = 10 (CPU/SRAM)
• R = 50 (CPU/DRAM)
• R = 100 (CPU/Disk)
• What will be the Pmiss for ME = 95% for each of these? 28
Memory Technologies-
Computational
• Cache between CPU registers and main memory
– Static RAM (6 transistors per cell)
– Typical Access Time ~10ns
• Main Memory
– Dynamic RAM (1 transistor + 1 capacitor)
– Capacitive leakage results in loss of data
• Needs to be refreshed periodically – hence the term
“dynamic”
– Typical Access Time ~50ns
– Typical Refresh Cycle ~100ms.
29
Memory Technologies-
Persistent
• Hard Disks
– Used for persistent online storage
– Typical access time: 10 to 15ms
– Semi-random or semi-sequential access:
• Access in blocks – typically – of 512 bytes.
– Cost per GB – Approx. Rs 5.50
• Flash Devices (Solid State Drive)

– Electrically Erasable Programmable ROM
– Used for persistent online storage
– Limit on Erases – currently 100,000 to 500,000
– Read Access Time: 50ns
– Write Access Time: 10 micro seconds
– Semi-random or semi-sequential write:
• Blocks – say 512 bits.
– Cost Per GByte – U.S. $5.00 (circa 2007) 30
Memory Technologies-Archival
• Magnetic Tapes
– Access Time – (Initial) 10 sec.; 60Mbps data transfer
– Density – up-to 6.67 billion bits per square inch
– Data Access – Sequential
– Cost - Cheapest
31
Caching
• L1, L2, and L3 caches between CPU and RAM

– Transparent to OS and Apps.
– L1 is typically on (processor) chip
• R=1 to 2
• May be separate for data and instructions (Why?)
• L3 is typically on-board (i.e. processor board
or “motherboard”)
– R=5 to 10
32
Caching- Generic [1]
• Caching as a principle can be applied between any two levels of

memory
– e.g. Buffer Cache (part of RAM)
– transparent to App, maintained by OS, between main memory and
hard disk,
• R RAM,buffer = 1
• e.g. Disk cache
– between RAM and hard disk
– typically part of disk controller
– typically semiconductor memory
– may be non-volatile ROM on high end disks to support power
breakdowns.
– transparent to OS and Apps 33
Storage Devices (Secondary Storage)
– Hard Disk Drive or Hard Disk or Disk Drive
Image source: en.wikipedia.org

34
Disk Storage
• Electromechanical hard disk drive called as a disk

drive, hard drive, or hard disk drive (HDD)
• Originally meant for PCs and mainframes
– 14 in. diameter for mainframes in 60’s
– 3.5 in. diameter for PCs from 80’s
• Now available in various shapes:
– Mini disks (2.5 in. dia.) for laptops, gaming consoles and as
external pocket storage
– Micro disks (1.68 in. or 1.8 in. dia) for iPods / Cameras /
other handheld devices
• Note: Size of diameter is also referred as form factor 35
Disk Drive: Geometry
Major Components
Platters
Read/write heads
Actuator assembly
Spindle motor
Source: dataclinic.co.uk
36
Disk Drive: Geometry
• Disk substrate coated with magnetizable material (iron

oxide…rust)
• Early days substrate is made of aluminium
• Now glass
– Improved surface uniformity
• Increases reliability
– Reduction in surface defects
• Reduced read/write errors
– Lower flight heights (r/w heads fly at some height from disk surface )
– Better stiffness
– Better shock/damage resistance 37
Data Organization and
Formatting
• Concentric rings or tracks
– Gaps between tracks
– Reduce gap to increase capacity
– Same number of bits per track
(variable packing density)
– Constant angular velocity
• Tracks divided into sectors
• Minimum block size is one
sector
• May have more than one
sector per block
• Individual tracks and sectors
are addressable
38
39
40
Zoned Disk Drive
Each track in a zone has the same number of sectors, determined

by the circumference of innermost track.
41
42
Disk Characteristics
• Fixed (rare) or movable head
– Fixed head
• One r/w head per track mounted on fixed ridged arm
– Movable head
• One r/w head per side mounted on a movable arm
• Removable or fixed
• Single or double (usually) sided
• Single or multiple platter
– Heads are joined and aligned
– Aligned tracks on each platter form cylinders
– Data is striped by cylinder
• Reduces head movement
• Increases speed (transfer rate)
• Head mechanism
– Contact (Floppy)
– Fixed gap
44
– Flying (Winchester)
Capacity
• Capacity: maximum number of bits that can be stored.
• Vendors express capacity in units of gigabytes (GB),
where 1 GB =109 Byte
• Capacity is determined by these technology factors:
– Recording density (bits/in): number of bits that can be
squeezed into a 1 inch segment of a track.
– Track density (tracks/in): number of tracks that can be
squeezed into a 1 inch radial segment.
– Areal density (bits/in^2): product of recording and track
density.
• Modern disk drives can have more than 1 tera bit of areal density
45
Computing Disk Capacity
• Capacity =(# bytes/sector) x
(# sectors/track) x
(# tracks/surface) x
(# surfaces/platter) x
(# platters/disk)
• Example:
– 512 bytes/sector
– 300 sectors/track
– 20,000 tracks/surface
– 2 surfaces/platter
– 5 platters/disk
– Capacity = 512 x 300 x 20000 x 2 x 5 = 30.72GB 46
Disk Drive- Addressing
• Access is always in group of 1 or more contiguous sectors
– Starting sector address must be specified for access
• Addressing:
– Cylinder, Head, Sector (CHS) addressing
– Logical Block Addressing (LBA)
• Issues in LBA:
– Bad sectors (before shipping)
• Address Sliding / Slipping could be used – skip bad sectors for
numbering
– Bad Sectors (during operation)
• Mapping – maintain a map from logical number to physical CHS
address
• Remap when you have a bad sector – use reserve sectors 47
Disk Drive – Access Time
• Read and writing in sector-sized blocks
– Typically 512 bytes
• Access time (Seek time + rotational delay)
– Seek time (Average seek time typically 6 to 9 ms)
– Rotational latency: (Average latency = ½ rotation)
• Transfer time (T = b/r*N)
• Total average access time
Ta = Ts+ 1/2r+b/r*N
– Here Ts is Average seek time
– r is rotation speed in revolution per second
– b number of bytes to be transferred
– N number of bytes on a track 48
Disk Access Time Example
• Disk Parameters:
• Transfer size is 8K bytes
• Advertised average seek is 12 ms
• Disk spins at 7200 RPM
• Transfer rate is 4 MB/sec
• Controller overhead is 2 ms
• Assume that disk is idle so no queuing delay
• What is Average Disk Access Time for a Sector?
• Ave seek + ave rot delay + transfer time + controller overhead
• 12 ms + 0.5/(7200 RPM/60) + 8 KB/4 MB/s + 2 ms
• 12 + 4.15 + 2 + 2 = 20 ms
• Advertised seek time assumes no locality: typically 1/4 to 1/3
advertised seek time: 20 ms => 12 ms 49
Disk Drive Performance
• Parameters to measure the performance
– Seek time
– Rotational Latency
– IOPS (Input/Output Operations per Second)
• Read
• Write
• Random
• Sequential
• Cache hit
• Cache miss
– MBps
• # of Mega bytes per second a disk drive can perform
• Useful to measure sequential workloads like media streaming 50
Disk Drive Performance: Eye
Opener Facts!
• Access time (Seek + Rotational ) rating
– Important to distinguish between sequential and
random access request set
• Usually vendors quote IOPS numbers to impress
– Important to note that whether the IOPS numbers
being quoted are cache hits or cache misses
• Real world workload is a mix of accesses with
– Read, Write, random, sequential, cache hit, cache
miss
51
Flash Memory
• Flash chips are arranged into blocks which are typically 128KB on
NOR and 8KB on NAND flash
• All flash cells are preset to one. These cells can be individually
programmed to zero
– Burt resetting bits from zero to one cannot be done individually, can
be done only by resetting or erasing a complete block
• The lifetime of a flash chip is measured in such erase cycles, with

the typical lifetime being 100,000 erases per block
– Erase count per block should be evenly distributed across all the blocks
for better life time of the flash chip
– Process is known as “wear leveling” 52
Traditional File System: Erase-
Modify-Write back
• Use 1:1 mapping from the emulated block device
to the flash chip
– read the whole erase block, modifying the
appropriate part of the buffer, erase and rewrite the
entire block
• No wear leveling…!
• Unsafe due to power loss between erase and write back
• Slightly better method
– by gathering writes to a single erase block and only
performing the erase/modify/write back procedure
when a write to a different erase block is requested. 53
Resume here!!
54
wear leveling
• How to provide wear leveling…?

– emulated block device are stored in varying
locations on the physical medium
– needs to keep track of the current location of each
sector in the emulated block device
• Use of Translation Layer to keep track of current mapping
– It is a form of Journaling File System
55
JFFS Storage Format
• It is a log structured file system

• Nodes containing data and metadata are stored
on the flash chips sequentially, progressing
strictly linearly through the storage space
available.
56
Wear Leveling
57
Flash Memory: Operations
• Out-place updating is usually done to avoid erasing

operations on every write
– Latest copy of data is “live”
• Page meta-data
– “Live” vs “dead”
– “Free” pages are unused pages
• Cleaning
– Required when free pages are not available
– Erasure is done per block
• May require copying of live data
• Some blocks may be erased more than others
– “Worn” pages
58
File System and Flash Access
• Although File Systems abstract out device details

– Most file system internals were designed for semi-
random access models:
• i.e. notions of blocks/sectors, block addresses, and buffers
tied to blocks are incorporated into many file systems
– E.g. Block devices in Unix file systems
• So, either file systems are redesigned for flash

• Or, device drivers handle flash memory access and
file systems use the same access models
– Require a Flash Translation Layer that emulates a block
device
59
Questions ?
60
CS 2
61
Storage Devices (Secondary Storage)
– Solid State Storage (SSDs and PCI expansion card)

• Flash Memory
Image Source: electronicdesign.com

62
SSD Fundamentals
– 2.5 in. and 3.5 in. form factors

– Supports SAS, SATA and FC interfaces and protocols
– Different types as Flash Memory, Phase change
Memory (PCM) Ferroelectric RAM (FRAM)
– Semiconductor based hence no mechanical parts
– Predictable performance due to no positional
latency (i.e. Seek time and Rotational latency)
63
Flash Memory
• Semiconductor based persistent storage
• Two types
– NAND and NOR flash
• Anatomy of flash memory
– Cells  Pages  Blocks
– New flash device comes with all cells set to 1
– Cells can be programmed from 1 to 0
– To change the value of cell back to 1 then we need
to erase entire block
• Can be erased at block level only!
64
Read/Write/Programming on
Flash Memory
• Read operation is the fastest operation
• First time write is very fast
– Every cell in the block is preset to 1 and can be
individually programmed to 0
– If any part of a flash memory block has already been
written to, all subsequent writes to any part of that
block will require a process called read/erase/program
• It is 100 times slower than read operation
– Erasing is a 10 times slower process than read
operation
65
NAND vs. NOR
NAND NOR
Cost per bit Low High
Capacity High Low
Read Speed Medium *High
Write Speed High Low
File Storage Use Yes No
Code Execution Hard Easy
Stand by Power Medium Low
Erase Cycles High Low
*Individual cells (in NOR) are connected in parallel which enables random reads faster
66
Anatomy of NAND Flash
• NAND Flash types
– Single level cell (SLC)
• A cell can store 1 bit of data
• Highest performance and longest life span (100,000 program/erase cycles per
cell)
– Multi level cell (MLC)
• Stores 2 bits of data per cell.
• P/E cycles = 10,000
– Enterprise MLC (eMLC)
• MLC with stronger error correction
• Heavily over-provisioned for high performance and reliability
– e.g. a 400 GB eMLC drive might actually have 800 GB of eMLC flash
– Triple level cell (TLC)
• Stores 3 bits per cell
• P/E cycles = 5,000 per cell
• High on capacity but low on performance and reliability 67
Enterprise Class SSD
• More over-provisioned capacity

– Provides Better performance and life-time
• More cache
– Any write to a block that already contains data
requires to copy the existing contents into the cache
– Helps to coalesce writes and combining writes
• More channels
– Allows concurrent I/O operations
• More comprehensive warranty
68
Hybrid Drives
• Having both rotating platter and

solid-state memory (i.e.
combination of HDD and SSD)
– Tradeoff between high capacity
and performance
• Hybrid storage technologies
– Dual drive
• Separate SSD and HDD devices are
installed in a computer
– SSHD drive
• Single drive having NAND flash
memory and HDD 69
Topics
• Disk reliability measures

• Improving Disk Reliability
– RAID Levels
Image source: msdn.microsoft.com

70
Disk Performance issues[1]
• Reliability
– Mean Time-Between-Failure (MTBF)
• e.g. 1.2 TB SAS drive states a MTBF value of 2 million hours
– Annual Failure Rate (AFR)
• To estimate the likelihood that a disk drive will fail during a
year of full use
• Individual Disk Reliability (as claimed in
manufacturer’s warranties) is often very high
– E.g. Rated: 30,000 hours In Practice: 100,000 for an
IBM disk in 80s
71
Disk Performance issues[2]
• Access Speed
– Access Speed of a pathway = Minimum speed among
all components in the path
– e.g. CPU and Memory Speeds vs. Disk Access Speeds
• Solution:
– Multiple Disks i.e. array of disks
– Issue: Reliability
• MTTF of an array = MTTF of a single disk / # disks in the
array
72
Disk Reliability
• Redundancy may be used to improve Reliability

– Device Level Reliability
• Improved by redundant disks
– This of course implies redundant data
– Data Level Reliability
• Improved by redundant data
– This of course implies additional disks
• (RAID) Redundant Array of Inexpensive Disks

– or Redundant Array of Independent Disks
• Different Levels / Modes of Redundancy
• Referred to as RAID levels
73
How to achieve reliability?
• Use more number of small sized disks !!

– What should be the number of disks?
– How small should be the disks?
– How should they be structured and used?
74
Performance Improvement in
Secondary Storage
• In general multiple components improves the
performance
• Similarly multiple disks should reduce access time?
– Arrays of disks operates independently and in parallel
• Justification
– With multiple disks separate I/O requests can be
handled in parallel
– A single I/O request can be executed in parallel, if the
requested data is distributed across multiple disks
• Researchers @ University of California-Berkeley
proposed the RAID (1988)
75
RAID
• Redundant Array of Inexpensive Disks

– Connect multiple disks together to
• Increase storage
• Reduce access time
• Increase data redundancy
• Provide fault tolerance
• Many different levels of RAID systems
• differing levels of redundancy,
• error checking,
• capacity, and cost
76
RAID Fundamentals
• Striping
– Map data to different disks
– Advantage…?
• Mirroring
– Replicate data
– What are the implications…?
• Parity
– Loss recovery/Error correction / detection
77
RAID
• Characteristics
1. Set of physical disks viewed as single logical drive
by operating system
2. Data distributed across physical drives
3. Can use redundant capacity to store parity
information
78
Data Mapping in RAID 0
No redundancy or error correction

Data striped across all disks
Round Robin striping
79
RAID 1
Mirrored Disks
Data is striped across disks
2 copies of each stripe on separate disks
Read from either and Write to both
80
Bit interleaved data

Lots of redundancy
Use parallel access technique
Very small size strips
Expensive: Good for erroneous disk
81
• Similar to RAID 2
• Only one redundant disk, no matter how large the array
• Simple parity bit for each set of corresponding bits
• Data on failed drive can be reconstructed from surviving data
and parity information
• Question:
• Can achieve very high transfer rates. How?
82
RAID 4
• Make use of independent access with block level striping
• Good for high I/O request rate due to large strips
• Bit by bit parity calculated across stripes on each disk
• Parity stored on parity disk
• Drawback???
83
RAID 5
• Round robin allocation for parity stripe
• It avoids RAID 4 bottleneck at parity disk
• Commonly used in network servers
• Drawback
– Disk failure has a medium impact on throughput
– Difficult to rebuild in the event of a disk failure (as
compared to RAID level 1)
84
RAID 6
• Two parity calculations
• Stored in separate blocks on different disks
• High data availability
– Three disks need to fail for data loss
– Significant write penalty
• Drawback
– Controller overhead to compute parity is very high
85
Nesting of RAID Levels:
RAID(1+0)
• RAID 1 (mirror) arrays are built first,
then combined to form a RAID 0
(stripe) array.
• Provides high levels of:
– I/O performance
– Data redundancy
– Disk fault tolerance.
86
Nesting of RAID Levels:
RAID(0+1)
• RAID 0 (stripe) arrays are built first, then
combined to form a RAID 1 (mirror) array
• Provides high levels of I/O performance
and data redundancy
• Slightly less fault tolerance than a 1+0
– How…?
87
RAID Implementations
• Software implementations are provided by many

Operating Systems.
• A software layer sits above the disk device drivers
and provides an abstraction layer between the
logical drives(RAIDs) and physical drives.
• Server's processor is used to run the RAID
software.
• Used for simpler configurations like RAID 0 and
RAID 1.
88
Comparison of the RAID levels
Question: Which RAID level should be used when ?

Answer: There is no absolute answer.

Comparison of the RAID levels cont.
• Manufacturers of disk subsystems have design

options in
• selection of the internal physical hard disks;
• I/O technique used for the communication within the disk
subsystem;
• use of several I/O channels;
• realization of the RAID controller;
• size of the cache;
• cache algorithms themselves;
• behavior during rebuild; and
• provision of advanced functions such as data scrubbing
and preventive rebuild
Creating a RAID 0 Array
• Requirements: minimum of 2 storage devices
• Primary benefit: Performance
• Things to keep in mind:
• Make sure that you have functional
backups.
• A single device failure will destroy all
data in the array.
• Identify the Component Devices
lsblk -o NAME,SIZE, FSTYPE, TYPE,
MOUNTPOINT
• create a RAID 0 array with these components
sudo mdadm --create --verbose
/dev/md0 --level=0 --raid-devices=2
/dev/sdb /dev/sdc
• Ensure that the RAID was successfully
created by checking the /proc/mdstat file
cat /proc/mdstat
Create and Mount the Filesystem
1. Create a filesystem on the array  sudo mkfs.ext4 -F /dev/md0

2. Create a mount point to attach the
 sudo mkdir -p /mnt/md0
new filesystem
3. Mount the filesystem  sudo mount /dev/md0 /mnt/md0
4. Check whether the new space is
available  df -h -x devtmpfs -x tmpfs
5. Save the Array Layout, to make
 sudo mdadm --detail --scan | sudo tee -
sure that the array is reassembled
automatically at boot a /etc/mdadm/mdadm.conf
 Create /etc/mdadm and /etc/mdadm/mdadm.conf if not existing,
6. Update the initramfs, or initial
and re-run the command.
RAM file system, so that the array
will be available during the early  sudo update-initramfs –u
boot process  echo '/dev/md0 /mnt/md0 ext4
7. Add the new filesystem mount defaults,nofail,discard 0 0' | sudo tee -a
options to the /etc/fstab file for /etc/fstab
automatic mounting at boot
• The RAID 1 array type is implemented by mirroring
data across all available disks.
• Each disk in a RAID 1 array gets a full copy of the
data, providing redundancy in the event of a device
failure.
• Primary benefit: Redundancy
• Things to keep in mind: Since two copies of the data
are maintained, only half of the disk space will be
usable
• Identify the Component Devices
lsblk -o NAME,SIZE,FSTYPE,TYPE,MOUNTPOINT
• Create the Array
sudo mdadm --create --verbose /dev/md1 --level=1
--raid-devices=2 /dev/sdd /dev/sde
• The mdadm tool will start to mirror the drives. • You can monitor the progress
• This can take some time to complete, but the array of the mirroring by checking
can be used during this time the /proc/mdstat file


new filesystem
• The RAID 5 array type is implemented by
striping data across the available devices.
• One component of each stripe is a
calculated parity block.
• If a device fails, the parity block and the
remaining blocks can be used to
calculate the missing data.
• The device that receives the parity block
is rotated so that each device has a
balanced amount of parity information.
• Requirements: minimum of 3 storage
devices
• Primary benefit: Redundancy with more
usable capacity.
• While the parity information is
distributed, one disk’s worth of capacity
will be used for parity.
• RAID 5 can suffer from very poor
performance when in a degraded state.

• Find the identifiers for the raw disks that you will be using and create the array
sudo mdadm --create --verbose /dev/md5 --level=5 --raid-devices=3 /dev/sdf
/dev/sdg /dev/sdh
• The mdadm tool will start to configure the array (it actually uses the recovery process to build
the array for performance reasons).
• This can take some time to complete, but the array can be used during this time.
• You can monitor the progress of the mirroring by checking the /proc/mdstat
cat /proc/mdstat


new filesystem
• The RAID 6 array type is implemented by striping data across the available devices.
• Two components of each stripe are calculated parity blocks.
• If one or two devices fail, the parity blocks and the remaining blocks can be used to
calculate the missing data.
• The devices that receive the parity blocks are rotated so that each device has a balanced
amount of parity information.
• This is similar to a RAID 5 array, but allows for the failure of two drives.
• Primary benefit: Double redundancy with more usable capacity.
• Things to keep in mind: While the parity information is distributed, two disk’s worth of
capacity will be used for parity.
• RAID 6 can suffer from very poor performance when in a degraded state.

• To get started, find the identifiers for the raw disks that you will be using:
• Create the Array

sudo mdadm --create --verbose /dev/md6 --level=6 --raid-devices=4
/dev/sdi /dev/sdj /dev/sdk /dev/sdl
• The mdadm tool will start to configure the array (it actually uses the
recovery process to build the array for performance reasons).
• This can take some time to complete, but the array can be used during this
time.
• You can monitor the progress of the mirroring by checking the /proc/mdstat
file:
cat /proc/mdstat


new filesystem
Creating a Complex RAID 10
Array
• The RAID 10 array type is traditionally
implemented by creating a striped RAID 0 array
composed of sets of RAID 1 arrays.
• This nested array type gives both redundancy
and high performance, at the expense of large
amounts of disk space.
• The mdadm utility has its own RAID 10 type that
provides the same type of benefits with
increased flexibility.
• It is not created by nesting arrays, but has many
of the same characteristics and guarantees.
• Primary benefit: Performance and redundancy

Array
• The amount of capacity reduction for the array is defined by the number of data copies user
chooses to keep. The number of copies that are stored with mdadm style RAID 10 is
configurable(not covered in this demonstration) . By default, two copies of each data block will
be stored in what is called the “near” layout. The possible layouts that dictate how each data
block is stored are:
• near: The default arrangement. Copies of each chunk are written consecutively when striping,
meaning that the copies of the data blocks will be written around the same part of multiple
disks.
• far: The first and subsequent copies are written to different parts the storage devices in the
array. For instance, the first chunk might be written near the beginning of a disk, while the
second chunk would be written half way down on a different disk. This can give some read
performance gains for traditional spinning disks at the expense of write performance.
• offset: Each stripe is copied, offset by one drive. This means that the copies are offset from
one another, but still close together on the disk. This helps minimize excessive seeking during
some workloads.
Array
• To get started, find the identifiers for the raw disks that you will be using:
• Create the Array by setting up two copies using the near layout by not specifying a layout and
copy number
sudo mdadm --create --verbose /dev/md10 --level=10 --raid-devices=4 /dev/sdm
/dev/sdn /dev/sdo /dev/sdp
• The mdadm tool will start to configure the array (it actually uses the recovery process to build
the array for performance reasons). This can take some time to complete, but the array can
be used during this time. You can monitor the progress of the mirroring by checking the
/proc/mdstat file:
cat /proc/mdstat

Array (Advanced)
• If you want to use a different layout, or change the number of copies, you will have to use the
--layout= option, which takes a layout and copy identifier.
• The layouts are n for near, f for far, and o for offset. The number of copies to store is
appended afterwards.
• For instance, to create an array that has 3 copies in the offset layout, the command would
look like this:
sudo mdadm --create --verbose /dev/md0 --level=10 --layout=o3 --raid- devices=4
/dev/sda /dev/sdb /dev/sdc /dev/sdd
This not being demonstrated


new filesystem
Observe the performance in
different RAID levels

Observe the failure and
performance impact of failure

Thank You!

DSTN Merged - CSI ZC447 ES ZC447IS ZC447SS ZC447 - CS-1&2

Uploaded by

Copyright:

Available Formats

DSTN Merged - CSI ZC447 ES ZC447IS ZC447SS ZC447 - CS-1&2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSTN Merged - CSI ZC447 ES ZC447IS ZC447SS ZC447 - CS-1&2

Uploaded by

Copyright:

Available Formats

Data Storage Technology and Networks

CSI ZC447/ ES ZC447/IS ZC447/SS ZC447

Data Storage Technology and Networks

Note: No re-attempts are allowed for missed EC 1 components.

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Printer, Modems, Secondary Storage,

• Direct Memory Access

• Per computation data (non-persistent) and

• Separate memory/storage required for both

Instructions per second = Cycles per second * Instructions per cycle

Instructions per 4 * 105 40 * 106 2 * 109 20 * 1010

Instructions per 4 * 105 40 * 106 2 * 109 20 * 1010

Instruction Size 3.8B 4B 4B 4B

Operands in 1.8 *4B 0.3 *4B 0.25*4B 0.25*4B

Instructions per 4 * 105 40 * 106 2 * 109 20 * 1010

Instruction Size 3.8B 4B 4B 4B

Operands in 1.8 *4B 0.3 *4B 0.25*4B 0.25*4B

BW Demand 4.4 Mbps 208 Mbps 10 Gbps 100 Gbps

• How do we meet memory BW requirements?

• How do we meet memory bandwidth requirements?

• Multi-Level Inclusion Principle

• Flash Devices (Solid State Drive)

• L1, L2, and L3 caches between CPU and RAM

• Caching as a principle can be applied between any two levels of

– Hard Disk Drive or Hard Disk or Disk Drive

Image source: en.wikipedia.org

• Electromechanical hard disk drive called as a disk

• Disk substrate coated with magnetizable material (iron

Each track in a zone has the same number of sectors, determined

• The lifetime of a flash chip is measured in such erase cycles, with

• How to provide wear leveling…?

• It is a log structured file system

• Out-place updating is usually done to avoid erasing

• Although File Systems abstract out device details

• So, either file systems are redesigned for flash

– Solid State Storage (SSDs and PCI expansion card)

Image Source: electronicdesign.com

– 2.5 in. and 3.5 in. form factors

• More over-provisioned capacity

• Having both rotating platter and

• Disk reliability measures

Image source: msdn.microsoft.com

• Redundancy may be used to improve Reliability

• (RAID) Redundant Array of Inexpensive Disks

• Use more number of small sized disks !!

• Redundant Array of Inexpensive Disks

No redundancy or error correction

Bit interleaved data

• Software implementations are provided by many

Question: Which RAID level should be used when ?

BITS Pilani, Pilani Campus

• Manufacturers of disk subsystems have design

1. Create a filesystem on the array  sudo mkfs.ext4 -F /dev/md0

BITS Pilani, Pilani Campus

1. Create a filesystem on the array  sudo mkfs.ext4 -F /dev/md1

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

1. Create a filesystem on the array  sudo mkfs.ext4 -F /dev/md5

BITS Pilani, Pilani Campus

Operands in 1.8 4B 0.3 4B 0.254B 0.254B

Operands in 1.8 4B 0.3 4B 0.254B 0.254B