DSTN Merged - CSI ZC447 ES ZC447IS ZC447SS ZC447 - CS-1&2
DSTN Merged - CSI ZC447 ES ZC447IS ZC447SS ZC447 - CS-1&2
DSTN Merged - CSI ZC447 ES ZC447IS ZC447SS ZC447 - CS-1&2
3 Solid State Device (SSD): Flash Memory: NAND and 1.3 T1: Ch-2
NOR Organization, R/W performance of Flash
memory
4 Array of Disks: Disk Reliability and different RAID 1.4 T1: Ch-4, T2:
Levels (0,1,2,3,4,5,6,1+0,0+1), RAID performance Ch-2 [2.5]
parameters, RAID Implementations
– Memory Hierarchy
Address
Control
Processor
Data bus
Main
Registers Memory
Address
Control
Data bus
5
BITS Pilani, Pilani Campus
6
BITS Pilani, Pilani Campus
7
BITS Pilani, Pilani Campus
I/O Techniques
– Polling
– Interrupt driven
– DMA
8
BITS Pilani, Pilani Campus
I/O Techniques
• Polling:
– CPU check on the status of an I/O device by reading
a memory address which is associated with an I/O
device
– Pseudo-asynchronous
• Processor inspects (multiple) devices in rotation
– Cons
• Processor may still be forced to do useless work or wait or
both
– Pros
• CPU can determines how often it needs to poll 9
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
I/O Techniques
• Interrupts:
– Processor initiates I/O by requesting an operation
with the device.
– May disconnect if response can’t be immediate,
which is usually the case
– When device is ready with a response it interrupts
the processor.
• Processor finishes I/O with the device.
– Asynchronous but
• Data transfer between I/O device and memory still
requires processor to execute instructions. 11
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
I/O Techniques: Interrupts
13
BITS Pilani, Pilani Campus
I/O Techniques
14
BITS Pilani, Pilani Campus
I/O Techniques
• I/O Processor
– More sophisticated version of DMA controller with
the ability to execute code: execute I/O routines,
interact with the O/S etc
15
BITS Pilani, Pilani Campus
Three Tier Architecture
• Computing
– Apps such as web servers, video conferencing,
database server, streaming etc.
• Networking
– Provides connectivity between computing nodes
– e.g. web service running on a computing node talks
to a database service running on another computer
• Storage (Persistent + Non-Persistent)
– All data resides
16
BITS Pilani, Pilani Campus
Memory Requirements
17
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[1]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns
18
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[2]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns
Instructions per 0.1 1 2 (4-way) 8 (quad core,2
cycle threads/core
19
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[3]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns
Instructions per 0.1 1 2 (4-way) 8 (quad core,2
cycle threads/core
20
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[4]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
21
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[5]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
22
BITS Pilani, Pilani Campus
Memory Hierarchy[1]
– Locality of reference
• Memory references are clustered to either a small region
of memory locations or same set of data accessed
frequently
23
BITS Pilani, Pilani Campus
Memory Hierarchy[2]
24
BITS Pilani, Pilani Campus
Memory Hierarchy[3]
25
BITS Pilani, Pilani Campus
RESUME HERE
26
BITS Pilani, Pilani Campus
Memory Hierarchy:
Performance
• Exercise:
– Effective Access time for 2-level hierarchy
27
BITS Pilani, Pilani Campus
Memory Hierarchy: Memory
Efficiency
• Memory Efficiency
– M.E. = 100 * (Th/Teff)
– M.E. = 100/(1+Pmiss (R-1)) [R = Th+1/Th]
• Maximum memory efficiency
– R = 1 or Pmiss = 0
– Consider
• R = 10 (CPU/SRAM)
• R = 50 (CPU/DRAM)
• R = 100 (CPU/Disk)
• What will be the Pmiss for ME = 95% for each of these? 28
BITS Pilani, Pilani Campus
Memory Technologies-
Computational
• Cache between CPU registers and main memory
– Static RAM (6 transistors per cell)
– Typical Access Time ~10ns
• Main Memory
– Dynamic RAM (1 transistor + 1 capacitor)
– Capacitive leakage results in loss of data
• Needs to be refreshed periodically – hence the term
“dynamic”
– Typical Access Time ~50ns
– Typical Refresh Cycle ~100ms.
29
BITS Pilani, Pilani Campus
Memory Technologies-
Persistent
• Hard Disks
– Used for persistent online storage
– Typical access time: 10 to 15ms
– Semi-random or semi-sequential access:
• Access in blocks – typically – of 512 bytes.
– Cost per GB – Approx. Rs 5.50
• Magnetic Tapes
– Access Time – (Initial) 10 sec.; 60Mbps data transfer
– Density – up-to 6.67 billion bits per square inch
– Data Access – Sequential
– Cost - Cheapest
31
BITS Pilani, Pilani Campus
Caching
32
BITS Pilani, Pilani Campus
Caching- Generic [1]
Major Components
Platters
Read/write heads
Actuator assembly
Spindle motor
Source: dataclinic.co.uk
36
BITS Pilani, Pilani Campus
Disk Drive: Geometry
41
BITS Pilani, Pilani Campus
42
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Disk Characteristics
• Fixed (rare) or movable head
– Fixed head
• One r/w head per track mounted on fixed ridged arm
– Movable head
• One r/w head per side mounted on a movable arm
• Removable or fixed
• Single or double (usually) sided
• Single or multiple platter
– Heads are joined and aligned
– Aligned tracks on each platter form cylinders
– Data is striped by cylinder
• Reduces head movement
• Increases speed (transfer rate)
• Head mechanism
– Contact (Floppy)
– Fixed gap
44
– Flying (Winchester)
BITS Pilani, Pilani Campus
Capacity
• Capacity: maximum number of bits that can be stored.
• Vendors express capacity in units of gigabytes (GB),
where 1 GB =109 Byte
• Capacity is determined by these technology factors:
– Recording density (bits/in): number of bits that can be
squeezed into a 1 inch segment of a track.
– Track density (tracks/in): number of tracks that can be
squeezed into a 1 inch radial segment.
– Areal density (bits/in^2): product of recording and track
density.
• Modern disk drives can have more than 1 tera bit of areal density
45
BITS Pilani, Pilani Campus
Computing Disk Capacity
• Capacity =(# bytes/sector) x
(# sectors/track) x
(# tracks/surface) x
(# surfaces/platter) x
(# platters/disk)
• Example:
– 512 bytes/sector
– 300 sectors/track
– 20,000 tracks/surface
– 2 surfaces/platter
– 5 platters/disk
– Capacity = 512 x 300 x 20000 x 2 x 5 = 30.72GB 46
BITS Pilani, Pilani Campus
Disk Drive- Addressing
• Access is always in group of 1 or more contiguous sectors
– Starting sector address must be specified for access
• Addressing:
– Cylinder, Head, Sector (CHS) addressing
– Logical Block Addressing (LBA)
• Issues in LBA:
– Bad sectors (before shipping)
• Address Sliding / Slipping could be used – skip bad sectors for
numbering
– Bad Sectors (during operation)
• Mapping – maintain a map from logical number to physical CHS
address
• Remap when you have a bad sector – use reserve sectors 47
BITS Pilani, Pilani Campus
Disk Drive – Access Time
• Read and writing in sector-sized blocks
– Typically 512 bytes
• Access time (Seek time + rotational delay)
– Seek time (Average seek time typically 6 to 9 ms)
– Rotational latency: (Average latency = ½ rotation)
• Transfer time (T = b/r*N)
• Total average access time
Ta = Ts+ 1/2r+b/r*N
– Here Ts is Average seek time
– r is rotation speed in revolution per second
– b number of bytes to be transferred
– N number of bytes on a track 48
BITS Pilani, Pilani Campus
Disk Access Time Example
• Disk Parameters:
• Transfer size is 8K bytes
• Advertised average seek is 12 ms
• Disk spins at 7200 RPM
• Transfer rate is 4 MB/sec
• Controller overhead is 2 ms
• Assume that disk is idle so no queuing delay
• What is Average Disk Access Time for a Sector?
• Ave seek + ave rot delay + transfer time + controller overhead
• 12 ms + 0.5/(7200 RPM/60) + 8 KB/4 MB/s + 2 ms
• 12 + 4.15 + 2 + 2 = 20 ms
• Advertised seek time assumes no locality: typically 1/4 to 1/3
advertised seek time: 20 ms => 12 ms 49
BITS Pilani, Pilani Campus
Disk Drive Performance
• Parameters to measure the performance
– Seek time
– Rotational Latency
– IOPS (Input/Output Operations per Second)
• Read
• Write
• Random
• Sequential
• Cache hit
• Cache miss
– MBps
• # of Mega bytes per second a disk drive can perform
• Useful to measure sequential workloads like media streaming 50
BITS Pilani, Pilani Campus
Disk Drive Performance: Eye
Opener Facts!
• Access time (Seek + Rotational ) rating
– Important to distinguish between sequential and
random access request set
• Usually vendors quote IOPS numbers to impress
– Important to note that whether the IOPS numbers
being quoted are cache hits or cache misses
• Real world workload is a mix of accesses with
– Read, Write, random, sequential, cache hit, cache
miss
51
BITS Pilani, Pilani Campus
Flash Memory
• Flash chips are arranged into blocks which are typically 128KB on
NOR and 8KB on NAND flash
• All flash cells are preset to one. These cells can be individually
programmed to zero
– Burt resetting bits from zero to one cannot be done individually, can
be done only by resetting or erasing a complete block
54
BITS Pilani, Pilani Campus
wear leveling
55
BITS Pilani, Pilani Campus
JFFS Storage Format
56
BITS Pilani, Pilani Campus
Wear Leveling
57
BITS Pilani, Pilani Campus
Flash Memory: Operations
60
BITS Pilani, Pilani Campus
CS 2
61
BITS Pilani, Pilani Campus
Storage Devices (Secondary Storage)
63
BITS Pilani, Pilani Campus
Flash Memory
• Semiconductor based persistent storage
• Two types
– NAND and NOR flash
• Anatomy of flash memory
– Cells Pages Blocks
– New flash device comes with all cells set to 1
– Cells can be programmed from 1 to 0
– To change the value of cell back to 1 then we need
to erase entire block
• Can be erased at block level only!
64
BITS Pilani, Pilani Campus
Read/Write/Programming on
Flash Memory
• Read operation is the fastest operation
• First time write is very fast
– Every cell in the block is preset to 1 and can be
individually programmed to 0
– If any part of a flash memory block has already been
written to, all subsequent writes to any part of that
block will require a process called read/erase/program
• It is 100 times slower than read operation
– Erasing is a 10 times slower process than read
operation
65
BITS Pilani, Pilani Campus
NAND vs. NOR
NAND NOR
Cost per bit Low High
Capacity High Low
Read Speed Medium *High
Write Speed High Low
File Storage Use Yes No
Code Execution Hard Easy
Stand by Power Medium Low
Erase Cycles High Low
*Individual cells (in NOR) are connected in parallel which enables random reads faster
66
BITS Pilani, Pilani Campus
Anatomy of NAND Flash
• NAND Flash types
– Single level cell (SLC)
• A cell can store 1 bit of data
• Highest performance and longest life span (100,000 program/erase cycles per
cell)
– Multi level cell (MLC)
• Stores 2 bits of data per cell.
• P/E cycles = 10,000
– Enterprise MLC (eMLC)
• MLC with stronger error correction
• Heavily over-provisioned for high performance and reliability
– e.g. a 400 GB eMLC drive might actually have 800 GB of eMLC flash
– Triple level cell (TLC)
• Stores 3 bits per cell
• P/E cycles = 5,000 per cell
• High on capacity but low on performance and reliability 67
BITS Pilani, Pilani Campus
Enterprise Class SSD
• Reliability
– Mean Time-Between-Failure (MTBF)
• e.g. 1.2 TB SAS drive states a MTBF value of 2 million hours
– Annual Failure Rate (AFR)
• To estimate the likelihood that a disk drive will fail during a
year of full use
• Individual Disk Reliability (as claimed in
manufacturer’s warranties) is often very high
– E.g. Rated: 30,000 hours In Practice: 100,000 for an
IBM disk in 80s
71
BITS Pilani, Pilani Campus
Disk Performance issues[2]
• Access Speed
– Access Speed of a pathway = Minimum speed among
all components in the path
– e.g. CPU and Memory Speeds vs. Disk Access Speeds
• Solution:
– Multiple Disks i.e. array of disks
– Issue: Reliability
• MTTF of an array = MTTF of a single disk / # disks in the
array
72
BITS Pilani, Pilani Campus
Disk Reliability
74
BITS Pilani, Pilani Campus
Performance Improvement in
Secondary Storage
• In general multiple components improves the
performance
• Similarly multiple disks should reduce access time?
– Arrays of disks operates independently and in parallel
• Justification
– With multiple disks separate I/O requests can be
handled in parallel
– A single I/O request can be executed in parallel, if the
requested data is distributed across multiple disks
• Researchers @ University of California-Berkeley
proposed the RAID (1988)
75
BITS Pilani, Pilani Campus
RAID
• Striping
– Map data to different disks
– Advantage…?
• Mirroring
– Replicate data
– What are the implications…?
• Parity
– Loss recovery/Error correction / detection
77
BITS Pilani, Pilani Campus
RAID
• Characteristics
1. Set of physical disks viewed as single logical drive
by operating system
2. Data distributed across physical drives
3. Can use redundant capacity to store parity
information
78
BITS Pilani, Pilani Campus
Data Mapping in RAID 0
Mirrored Disks
Data is striped across disks
2 copies of each stripe on separate disks
Read from either and Write to both
80
BITS Pilani, Pilani Campus
Data Mapping in RAID 2
82
BITS Pilani, Pilani Campus
RAID 4
• Make use of independent access with block level striping
• Good for high I/O request rate due to large strips
• Bit by bit parity calculated across stripes on each disk
• Parity stored on parity disk
• Drawback???
83
BITS Pilani, Pilani Campus
RAID 5
• Round robin allocation for parity stripe
• It avoids RAID 4 bottleneck at parity disk
• Commonly used in network servers
• Drawback
– Disk failure has a medium impact on throughput
– Difficult to rebuild in the event of a disk failure (as
compared to RAID level 1)
84
BITS Pilani, Pilani Campus
RAID 6
• Two parity calculations
• Stored in separate blocks on different disks
• High data availability
– Three disks need to fail for data loss
– Significant write penalty
• Drawback
– Controller overhead to compute parity is very high
85
BITS Pilani, Pilani Campus
Nesting of RAID Levels:
RAID(1+0)
• RAID 1 (mirror) arrays are built first,
then combined to form a RAID 0
(stripe) array.
• Provides high levels of:
– I/O performance
– Data redundancy
– Disk fault tolerance.
86
BITS Pilani, Pilani Campus
Nesting of RAID Levels:
RAID(0+1)
• RAID 0 (stripe) arrays are built first, then
combined to form a RAID 1 (mirror) array
• Provides high levels of I/O performance
and data redundancy
• Slightly less fault tolerance than a 1+0
– How…?
87
BITS Pilani, Pilani Campus
RAID Implementations
• The RAID 6 array type is implemented by striping data across the available devices.
• Two components of each stripe are calculated parity blocks.
• If one or two devices fail, the parity blocks and the remaining blocks can be used to
calculate the missing data.
• The devices that receive the parity blocks are rotated so that each device has a balanced
amount of parity information.
• This is similar to a RAID 5 array, but allows for the failure of two drives.
• Requirements: minimum of 4 storage devices
• Primary benefit: Double redundancy with more usable capacity.
• Things to keep in mind: While the parity information is distributed, two disk’s worth of
capacity will be used for parity.
• RAID 6 can suffer from very poor performance when in a degraded state.
• The mdadm tool will start to configure the array (it actually uses the
recovery process to build the array for performance reasons).
• This can take some time to complete, but the array can be used during this
time.
• You can monitor the progress of the mirroring by checking the /proc/mdstat
file:
cat /proc/mdstat