Week 8
Week 8
Week 8 Lecture 1
Class BSCCS2001
Materials
Module # 36
Type Lecture
Week # 8
Algorithms are always un-ambiguous and are used as specifications for performing calculations, data processing,
automated reasoning and other tasks
Program
A computer program is a collection of instructions that can be executed by a computer to perform a specific task
Analysis of Algorithms
Why?
Week 8 Lecture 1 1
Why analyze?
What?
What to analyze?
How?
How to analyze?
Where?
Where to analyze?
When?
When to analyze?
Why analyze?
Practical reasons:
Core Issues:
Predict performance
Compare algorithms
Provide guarantees
What to analyze?
Core Issues → Cannot control what we cannot measure
Time
Week 8 Lecture 1 2
Most common analysis factor
Space
Widely exported
What to analyze?
Sum of natural numbers
int sum(int n)
{
int s = 0;
for(; n > 0; --n)
s = s + n;
return s;
}
What to analyze?
Find a character in a string
Week 8 Lecture 1 3
return 0;
}
n = strlen(str)
What to analyze?
Minimum of a sequence of numbers
int t = a[--n];
for(; n > 0; --n)
if(t < a[--n])
t = a[n];
return t;
}
How to analyze?
Counting model
Asymptotic model
Generating functions
Master Theorem
Operations
Intermediate Stages
int fact(int n)
{
if (n != 0)
return n * fact(n - 1);
return 1;
}
Week 8 Lecture 1 4
int fact(int n)
{
int t = 1;
for(; n > 0; --n)
t = t * n;
return t
}
Core Idea → Cannot compare actual times; hence, compare Growth or how the time increases with input size
Big-Oh → O(.)
Big-Omega → Ω(.)
Big-Theta Θ(.)
int count = 0;
for(int i = 0; i < N; ++i)
for(int j = i + 1; i < N; ++j)
if (a[i] + a[j] == 0)
count++;
1, logN, N, NlogN, N 2 , N 3 , and, 2N suffices to describe to describe the order of growth of most common
algorithms
Week 8 Lecture 1 5
Source: Page #188, Chapter 1: Fundamentals, Section 1.4, Algorithms (4th Edition) by Robert Sedgewick & Kevin Wayne
Week 8 Lecture 1 6
Source: Page #187, Chapter 1: Fundamentals, Section 1.4, Algorithms (4th Edition) by Robert Sedgewick & Kevin Wayne
Asymptotic Notation
For a given function g(n), we denote by O(g(n)) the set of functions:
O(g(n)) = {f(n): there exists positive constants c and n0 such that 0 ≤ f(n) ≤ cg(n), for all n > n0 }
We use O -notation to give an upper bound on a function, to within a constant factor
When we say that the running time of A is O(n2 ), we mean that there is a function f(n) that is O(n2 ) such that for
any value of n, no matter what particular input of size n is chosen, the running time on that input is bounded from
above by the value f(n)
Where to analyze?
Algorithmic situation
Best case
Worst case
Average case
Week 8 Lecture 1 7
Probabilistic case
Amortized case
Week 8 Lecture 1 8
Source: https://www.bigocheatsheet.com
Week 8 Lecture 1 9
📚
Week 8 Lecture 2
Class BSCCS2001
Materials
Module # 37
Type Lecture
Week # 8
Most data structure has a container for the data and typical operations that it needs to perform
Create
Insert
Delete
Find/Search
Close
Efficiency is measured in terms of time and space taken for these operations
Since data elements are sequentially connected, each element is traversable through a single run
Week 8 Lecture 2 1
Examples of linear data structures are Array, Linked List, Queue, Stack, etc
Array → The data elements are stored at contiguous locations in the memory
Linked List → The data elements are not required to be stored in contiguous locations in memory
Rather, each element stores a link (a pointer to a reference) to the location of the next element
The element that has been inserted first in the queue would be removed first
Thus, insert and removal of the elements in this take place in the same order
The element that has been inserted last in the stack would be removed first
Thus, insert and removal of the elements in this take place in the reverse order
For example → let the array name be arr , we can access the element at index 5 as arr[5]
Arrays allow random access using its index which is fast (cost of O(1))
Have fixed size, not flexible → Since we do not know the number of elements to be stored in runtime, If we create it
too large then it can be a waste of memory, if we create it too small then some elements may not be accommodated
in the array
However, during the execution of the program only 5 elements are available, which results in the wastage of
the memory space
Insertion and removal of elements from an array are costlier since the memory locations have to be consecutive
Week 8 Lecture 2 2
Remove from the end:
Insert and remove elements at any arbitrary position is costly (cost of O(n))
A new element can be stored anywhere in the memory where free space is available
For each new element allocated, a link (a pointer or a reference) is created for the new element using which the
element can be added to the linked list
Week 8 Lecture 2 3
Each element is stored in a node
Flexible in size
Size of a linked list grows or shrinks as and when new elements are inserted or deleted
Insertion or Removal of an element at/from any arbitrary position is efficient as none of the elements are required to
be moved to new locations
Insertion at front
Week 8 Lecture 2 4
Insertion at end
Week 8 Lecture 2 5
Insertion at any intermediate position
Week 8 Lecture 2 6
Linear Search
The algorithm starts with the first element, compares with the given key value and returns yes if a match is found
If it does not match, then it proceeds sequentially comparing each element of the list with the given key until a match
has been found or the full list is traversed
Let the given input list be inputArr = ['a', 'c', 'a', 'd', 'e', 'm', 'i', 'c', 's'] and the search key be 'i'
inputArr = ['a', 'c', 'a', 'd', 'e', 'm', 'i', 'c', 's']
k = 'i'
Week 8 Lecture 2 7
index = linear_search(inputArr, k)
if index != -1:
print("Element found at " + index)
Binary Search
The input for the algorithm is always a sorted list
The algorithm compares the key k with the middle element in the list
If the key does not match and is greater than the middle element, then the new list is the list to the right of the middle
element
If the key does not match and is less than the middle elements, then the new list is the list to the left of the middle
element
Let the given input list be inputArr = ['a', 'a', 'c', 'c', 'd', 'e', 'i', 'm', 's'] and the search key be 'i'
if arr[mid] < k:
low = mid + 1
elif arr[mid] > k:
high = mid - 1
else:
return mid
return -1
inputArr = ['a', 'a', 'c', 'c', 'd', 'e', 'i', 'm', 's']
k = 'i'
index = binary_search(inputArr, k)
if index != -1:
print("Element found at " + index)
Week 8 Lecture 2 8
Source: https://www.bigocheatsheet.com
Week 8 Lecture 2 9
📚
Week 8 Lecture 3
Class BSCCS2001
Materials
Module # 38
Type Lecture
Week # 8
However, the actual used space may be lower in array while linked list has an overhead of 100% (double)
All of them have complexities that are identical for Worst as well as Average case
All of them offer satisfactory complexity for some operations while being unsatisfactory on the others
Week 8 Lecture 3 1
Non-linear data structures can be used to trade-off between extremes and achieve a balanced good performance for
all
Non-linear data structures are those data structures in which data items are not arranged in a sequence and each
element may have multiple paths to connect to other elements
Unlike linear data structures, in which each element is directly connected with utmost 2 neighbouring elements
(previous and next elements), non-linear data structures may be connected with more than 2 elements
The elements don't have a single path to connect to the other elements but have multiple parts
Traversing through the elements is not possible in one run as the data is non-linearly arranged
Hash Table → Array with lists (coalesced chains) and one or more hash functions
and so on ...
Graph
Graph → Graph G is a collection of vertices V (store the elements) and connecting edges (links) E between the
vertices:
Undirected or Directed
Unweighted or Weighted
Cyclic or Acyclic
Disconnected or Connected
and so on ...
ER Diagram
Friendships in Facebook
Knowledge Graph
Tree
Tree → Is a connected acyclic graph representing hierarchical relationship
Week 8 Lecture 3 2
A tree may be:
Rooted or Unrooted
Binary or n-ary
Balance or Unbalanced
and so on ...
Composite Attributes
Family Genealogy
Search Trees
and so on ...
Root → The node at the top of the tree is called the root
There is only one root per tree and one path from the root node to any node
Parent → The node which is a predecessor of any node is called the parent node
Internal Nodes → The node which has at least one child is called the Internal Node
Path → Path refers to the sequence of nodes along the edges of a tree
B has arity 3, E has arity 2, G has arity 1 and D has arity 0 (Leaf)
Levels → The root node is said to be at level 0 and the children of the root node are at level 1 and the children of the
nodes which are at level 1 will be at level 2 and so on ...
Week 8 Lecture 3 3
Level is the length of the path (number of links) or distance of a node from the root node
Binary Tree → It is a tree, where each node can have at most 2 children
It has arity 2
h + 1 ≤ n ≤ 2h+1 − 1
⌈lg(n + 1)⌉ − 1 ≤ h ≤ n − 1
O(lg n) ≤ h ≤ O(n)
For a k -ary tree ,O(lgk n) ≤ h ≤ O(n)
Hash Table
Hash Table (or Hash Map) → Implements an associative array abstract data type, a structure that can map keys to
values by using a hash function to compute an index (hash code), into an array of buckets or slots, from which the
desired value can be found
Open Addressing
2-Choice Hashing
and so on ...
Associative arrays
Database indexing
Caches
and so on ...
Week 8 Lecture 3 4
The array makes insertion and deletion expensive at O(n)
The linked list, on the other hand, is efficient in insertion and deletion at O(1), while it makes the search
expensive at O(n)
O(1) insert/delete is possible because we just need to manipulate the pointers and not physically move the
data
Using the non-linearity, specifically (binary) trees, we can combine the benefits of both
Note that once an array is sorted, we know the order in which its elements may be checked (for any key) during a
search
As the binary search splits the array, we can conceptually consider the Middle Element to be the Root of a tree and
left (right) sub-array to be its left (right) sub-tree
First → M
Second → L or R
Third:
For L → LL or LR
For R → RL or RR
Recursive ...
Put as a tree
Binary Search Tree (BST) → It is a tree in which all the nodes hold the following:
The value of each node in the left subtree is less than the value of its root
The value of each node in the right subtree is greater than the value of its root
Week 8 Lecture 3 5
Structure of BST node → Each node consists of an element (X), and a link to the left child or the left subtree (LC),
and a link to the right child or the right subtree (RC)
E, L, P, H, A, N, T
Week 8 Lecture 3 6
1. Compare the key with the element at root
Searching a key in the BST is O(h), where h is the height of the key
Worst Case
The BST is skewed binary search tree (all the nodes except the leaf would have one child)
Best Case
This is possible if
Week 8 Lecture 3 7
Source: https://www.geeksforgeeks.org/difference-between-linear-and-non-linear-data-structures/
Week 8 Lecture 3 8
📚
Week 8 Lecture 4
Class BSCCS2001
Materials
Module # 39
Type Lecture
Week # 8
Reliability
Non-volatile storage →
Includes secondary and tertiary storage, as well as battery backed-up main memory
Volatile
Main memory
Week 8 Lecture 4 1
Fast access (10's to 100's of nanoseconds (ns))
1 ns = 10−9 seconds
Generally too small (or too expensive) to store the entire DB
Capacities have gone up and per-byte costs have decreased steadily and rapidly (roughly factor of 2 every 2-
3 years)
Volatile
Contents of the main memory are usually lost if a power failure or system crash occurs
Data can be written at a location only once, but location can be erased and written to again
Widely used in embedded devices such as digital cameras, phones and USB keys
Data must be moved from the disk to main memory for access, and written back for storage — much slower access
than the main memory
Direct-access
Possible to read data on the disk in any order, unlike magnetic tape
Much larger capacity and much lower cost/byte than the main memory/flash memory
Growing constantly and rapidly with technology improvements (factor of 2 to 3 every 2 years)
Write-one, Read-many (WORM) optical disks used for archival storage (CD-R, DVD-R, DVD+R)
Multiple write versions also available (CD-RW, DVD-RW, DVD+RW and DVD-RAM)
Juke-box systems, with large number of removable disks, a few drives and a mechanism for automatic
loading/unloading of disks available for storing large volumes of data
Sequential-access
Week 8 Lecture 4 2
Very high capacity (40-300TB tapes available)
Tape can be removed from drive storage costs much cheaper than the disk, but drives are expensive
Hundreds of terabytes (TB) (1TB = 1012 bytes) to even multiple petabytes (PB) (1PB = 1015 bytes)
Storage hierarchy
Magnetic Disk
Magnetic Disk: Mechanism
Week 8 Lecture 4 3
Read-write head
To read/write a sector
Head-disk assemblies
Week 8 Lecture 4 4
Magnetic Disk: Disk Controller, Subsystems and Interfaces
Disk Controller → Interfaces between the computer system and the disk drive hardware
Initiates actions moving the disk arm to the right track, reading or writing the data
Computes and attaches checksums to each sector to verify that correct read back
Disk Subsystem
Week 8 Lecture 4 5
Disk Interface Standards Families → ATA, SATA, SCSI, SAS and several variants
Storage Area Networks (SAN) → Connects disks by a high-speed network to a number of servers
Network Attached Storage (NAS) → Provides a file system interface using networked file system protocol
Seek Time → Time to reposition the arm over the correct track
Avg. seek time is 1/2 the worst case seek time; 1/3 if all the tracks have same number of sectors
Rotational Latency → Time for the sector to be accessed to appear under the head
Data-transfer rate → The rate at which data can be retrieved from or stored to the disk
Multiple disks may share a controller, so rate that controller can handle is also important
Mean Time to Failure (MTTF) → Avg. time the disk is expected to run continuously without any failure
Typically 3 to 5 years
Probability of failure of new disks is quite low, corresponding to a theoretical MTTF of 500,000 to 1,200,000 hours
for a new disk
For example → an MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an
average one will fail every 1200 hours
Magnetic Tapes
Hold large volumes of data and provide high transfer rates
Tape formats
Some formats (Accelis) provide faster seek (10's of seconds) at cost of lower capacity
Used mainly for backup, for storage of infrequently used information and as an offline medium for transferring
information from one system to another
Cloud Storage
Cloud storage is purchased from a 3rd party cloud vendor who owns and operates the data storage capacity and
delivers it over the internet in a pay-as-you-go model
The cloud storage vendors manage capacity, security and durability to make data accessible to applications all around
the world
Applications access cloud storage through traditional storage protocols or directly via an API
Week 8 Lecture 4 6
Many vendors offer complementary services designed to help collect, manage, secure and analyze data at a massive
scale
Google Drive
Amazon Drive
Microsoft Onedrive
Evernote
Dropbox
and so on ...
Other storage
Optical Disks
Compact disk-read only memory (CD-ROM)
Seek time about 100 msec (optical read is heavier and slower)
Higher latency (3000 RPM) and lower data-transfer rates (3 - 6MB/s) compared to magnetic disks
DVD-10 and DVD-18 are double sided formats with capacities of 9.4GB and 17GB
Flash drives
Flash drives are often referred to as pen drives, thumb drives or jump drives
Week 8 Lecture 4 7
Considering how large and inexpensive they have become, they have also replaced CDs and DVDs for data
storage purposes
USB flash drives are removable and re-writable storage devices that, as the name suggests, require a USB port for
connection and utilizes non-volatile flash memory technology
The storage space in USB is quite large with sizes ranging from 128MB to 2TB
The USB standard a flash drive is built around will determine the number of things about its potential performance,
including maximum transfer rate
Due to their relatively small size, SD cards are widely used in mobile electronics, cameras, smart devices, video game
consoles and more
Source: https://integralmemory.com/faq/what-are-differences-between-fat16-fat32-and-exfat-file-systems
Flash storage
NOR Flash vs NAND Flash
NAND Flash
Used widely for storage, since it is much cheaper than NOR Flash
Solid State Disks → Use multiple flash storage devices to provide higher transfer rate of 200MB/s or higher
Remapping of logical page addresses to physical page addresses avoids waiting for erase
After 100,000 to 1,000,000 erases, erase block becomes unreliable and cannot be used
Wear leveling
SSDs speed up computers significantly due to their low read-access time and fast throughput
Week 8 Lecture 4 8
It stores the data in the persistent state even when no power is supplied
The speed of SSD much larger than that of HDD as it reads/writes data at a higher input-output per second
Future of Storage
DNA Digital Storage
Oooooooh, we going Cyberpunk
DNA digital data storage is the process of encoding and decoding binary data to and from synthesized strands of DNA
While DNA as a storage medium has enormous potential because of its high storage density, its practical use is
currently severely limited because of its high cost and very slow read and write times
Digital storage systems encode the text, photos, videos or any kind of information as a series of 0s and 1s
This same information can be encoded in DNA using the 4 nucleotides that make up the genetic code: A, T, G
and C
DNA has several other features that makes it desirable as a storage medium; it is extremely stable and is fairly easy
(but expensive) to synthesize and sequence
Also, because of its high density — each nucleotide, equivalent to up to 2 bits, is about 1 cubic nanometer - an
exabyte (1018 bytes) of data stored as DNA could fit in the palm of our hands
DNA synthesis → A DNA synthesizer machine builds synthetic DNA strands matching the sequence of digital code
Quantum Memory
Quantum Memory is the quantum-mechanical version of ordinary computer memory
Whereas ordinary memory stores information as binary states (represented by 1's and 0's)
Quantum memory is essential for the development of many devices in quantum information processing applications
such as quantum network, quantum repeater, linear optical quantum computation or long-distance quantum
communication
Week 8 Lecture 4 9
Unlike the classical memory of everyday computers, the states stored in quantum memory can be in a quantum
superposition, giving much more practical flexibility in quantum algorithms than classical information storage
Week 8 Lecture 4 10
📚
Week 8 Lecture 5
Class BSCCS2001
Materials
Module # 40
Type Lecture
Week # 8
A collection of files
A file is
A sequence of records
A record is
A sequence of fields
One approach:
This case is easiest to implement; will consider variable length records later
Fixed-Length Records
Simple approach
Store record i starting from byte n ∗ (i − 1), where n is the size of each record
Week 8 Lecture 5 1
Record access is simple but records may cross blocks
Move record n to i
Do not move records, but link all the free records on a free list
Week 8 Lecture 5 2
Free Lists
Store the address of the first deleted record in the file header
Use this first record to store the address of the second deleted record, and so on ...
Consider these stored addresses as pointers since they point to the location of the record
More space efficient representation → Re-use space for normal attributes of free records to store pointers (No
pointers stored in in-use records)
Variable-Length Records
Variable-length records arise in DB systems in several ways:
Record types that allow variable lengths for one or more fields such as strings (varchar)
Record types that allow repeating fields (used in some older data models)
Variable length attributes represented by fixed size (offset, length) with actual data stored after all fixed length
attributes
Week 8 Lecture 5 3
Slotted page header contains:
Records can be moved around within a page to keep them contiguous with no empty space between them; entry in
the header must be updated
Pointers should not point directly to record — instead they should point to the entry for the record in the header
Sequential → Store records in sequential order, based on the value of the search key of each record
Hashing → A hash function computed on some attributes of each record; the result specifies in which block of the file
the record should be placed
In a multitable clustering file organization records of several different relations can be stored in the same file
Week 8 Lecture 5 4
Deletion → use pointer chains
Need to re-organize the file from time to time to restore sequential order
Week 8 Lecture 5 5
Multitable Clustering File Organization
Store several relations in one file using a multitable clustering file organization
Good for queries involving department ⋈ instructor and for queries involving one single department and its
instructors
names of relations
integrity constraints
Week 8 Lecture 5 6
Information about indices
Storage Access
A database file is partitioned into fixed-length storage units called blocks
Database system seeks to minimize the number of block transfers between the disk and the memory
We can reduce the number of disk accesses by keeping as many blocks as possible in the main memory
Buffer → Portion of the main memory available to store copies of disk blocks
Buffer Manager → Subsystem responsible for allocating buffer space in the main memory
Buffer Manager
Programs call on the buffer manager when they need a block from the disk
If the block is already in the buffer, buffer manager returns the address of the block in the main memory
Replacing (throwing out) some other block, if required, to make space for the new block
Replaced block written back to disk only if it was modified since the most recent time that it was written to
/ fetched from the disk
Reads the block from the disk to the buffer, and returns the address of the block in the main memory to the
requester
Idea behind LRU — Use past pattern of block references as a predictor of future references
Week 8 Lecture 5 7
Queries have well-defined access patterns (such as sequential scans) and a database system can use the
information in a user's query to predict future references
LRU may be a bad strategy for certain access patterns involving repeated scans of data
For example → When computing the join of 2 relations r and s by a nested loop
Mixed strategy with hints on replacement strategy provided by the query optimized is preferable
Pinned block → Memory block that is not allowed to be written back to the disk
Toss-immediate strategy → Frees the space occupied by a block as soon as the final tuple of that block has been
processed
Most recently used (MRU) strategy → System must pin the block currently being processed
After the final tuple of that block has been processed, the block is unpinned and it becomes the most recently
used block
Buffer manager can use statistical information regarding the probability that a request will reference a particular
relation
Buffer managers also support forced output of blocks for the purpose of recovery
Week 8 Lecture 5 8