Lectures 1-10 (7 Files Merged)
Lectures 1-10 (7 Files Merged)
Lectures 1-10 (7 Files Merged)
Lec. 1
Dr. Emad Nabil
Spring 2018
Grading Scheme
• Course work and MT (40%)
• Course work include (practical assignments, project)
• Cheating will be graded by –ve marks
• No late submission
• A final exam (60%)
“Python has been an important part of Google since the beginning, and remains so as the system grows and
evolves. Today dozens of Google engineers use Python, and we're looking for more people with skills in this
language”
Peter Norvig, director of search quality at Google
Purpose of the course
► Objective of Data Structures was to teach ways of efficiently organizing and
manipulating data in main memory.
► In this course, you will learn equivalent techniques for organization and
manipulation of data in secondary storage.
► In the first part of the course, you will learn about "low level" aspects of file
manipulation (basic file operations, secondary storage devices)
► In the second part of the course, you will learn the most important high-level
file structure tools (indexing, co-sequential processing, B trees, Hashing, etc…).
► You will apply these concepts in the design of C ++ programs for solving
various file management problems
Course outline
1. Fundamental File Processing Operations.
2. Sequential and direct access.
3. Secondary Storage, physical storage devices: disks, tapes and CDROM.
4. Reclaiming space in files.
5. Internal sorting, binary searching, keysorting.
6. Cosequential processing and external sorting
7. Indexing
8. Multilevel indexing and B trees
9. Hashing
Data structure VS File Structure
Both involve:
Representation of Data
+
Operations for accessing data
►Difference:
– Data Structures deal with data in main memory
– File Structures deal with data in secondary storage device (File).
Computer Architecture
Memory Hierarchy
Memory Hierarchy
►On systems with 32-bit addressing, only 232 bytes (4 Giga) can be directly referenced in main
memory.
►On systems with 64-bit addressing, 264 bytes can be directly referenced in main memory.
► This requires storage devices that retain information when the computer is restarted.
How Fast
Typical times for getting info (Access)
– Main memory: ~120 nanoseconds = 120×10−9
– Magnetic Disks: ~30 milliseconds = 30×10−3
►Secondary Storage
►Examples: anything that you can store in a disk hard drive, tape,
optical media, and any other medium which doesn’t lose the
information when the power is turned off.
Where do file structure fit in computer science?
Physical file and logical files
► Physical file:
* physically exists on secondary storage;
* known by the operating system;
* appears in its file directory
► Logical file,
* what your program actually uses,
* a ‘pipe’ though which information can be extracted, or sent.
►Operating system:
* get instruction from program or command line;
* link logical file with physical file or device
Physical file and logical files
Physical file and logical files
Opening file
Opening files
File contents
Demo
File Organization and Processing
Lec. 3
Dr. Emad Nabil
Spring 2018
Revision (read/write functions)
File streams include two member functions specifically designed to read and write
binary data sequentially: write and read. The first one (write) is a member function
of ostream (inherited by ofstream). And read is a member function of istream
(inherited by ifstream). Objects of class fstream have both. Their prototypes are:
The size parameter is an integer value that specifies the number of characters to
be read or written from/to the memory block.
Simple file
Output to File test.txt
--------------------------------------------------------
Jones This is how the file will
Smith look like
Willis
Davis
File seek operation
• A fstream has 2 file pointers:
• get pointer (for input)
• put pointer (for output)
• file1.seekg ( byte_offset, origin); //moves get pointer
• file1.seekp ( byte_offset, origin); //moves put pointer
Ex:
file1.seekg ( 373, ios::beg);
meaning moves get pointer 373 bytes from the beginning of file
File seek operation
Statement How it Affects the Read/Write Position
File.seekp(32L, ios::beg); Sets the write position to the 33rd byte (byte 32) from the beginning of the file.
file.seekp(-10L, ios::end); Sets the write position to the 11th byte (byte 10) from the end of the file.
file.seekp(120L, ios::cur); Sets the write position to the 121st byte (byte 120) from the current position.
file.seekg(2L, ios::beg); Sets the read position to the 3rd byte (byte 2) from the beginning of the file.
file.seekg(-100L, ios::end); Sets the read position to the 101st byte (byte 100) from the end of the file.
file.seekg(40L, ios::cur); Sets the read position to the 41st byte (byte 40) from the current position.
file.seekg(0L, ios::end); Sets the read position to the end of the file.
#include <iostream>
Ex-0
#include <fstream>
using namespace std;
int main( )
{ File contents:
fstream file("d:\\test.txt", ios::in);
char ch;
abcdefghijk
f
file.seekg(5L, ios::beg);
file.get(ch);
cout << "Byte 5 from beginning: " << ch << endl;
file.seekg(-5L, ios::end); g
file.get(ch);
cout << "Byte 10 from end: " << ch << endl;
file.seekg(3L, ios::cur); k
file.get(ch);
cout << "Byte 3 from current: " << ch << endl;
file.close();
}
The tellp and tellg Member Functions
• tellp returns a long integer that is the current byte number of the
file’s write position.
• tellg returns a long integer that is the current byte number of the
file’s read position.
Example: Write a C++ program, with no loops, to copy the contents of the file
d:\\source.txt to the file d:\\destination.txt, assume that destination file is empty, don’t
use any loops.
int main ()
{
ifstream infile ("source.txt",ios::binary); ex1
ofstream outfile ("destination.txt",ios::binary);
infile.seekg (0,infile.end);
long size = infile.tellg();
std::istream::getline
istream& getline (char* s, streamsize n );
istream& getline (char* s, streamsize n, char delim );
Examples:
char data[100];
cin.getline(data, 100); the default delimiter is ‘\n’
cin.getline(data ,100, ‘#’);
will continue reading till find ‘#’,
ifstream infile; or reach 100 char,
infile.getline(data, 100); \n will be read as a regular char.
infile.getline(data, 100 ,‘#’);
Function <string>
std::getline (string)
istream& getline (istream& is, string& str, char delim);
istream& getline (istream& is, string& str);
Get line from stream into string, Extracts characters from is and stores them into str until the delimitation
character delim is found (or the newline character, '\n', for (2)).
text
Length
void simple_read_example()
void simple_write_example()
{
{ ex3
ifstream infile;
ofstream outfile;
string str;
string str;
//default ios::in
//default ==> ios::out | ios::trunc
infile.open("d:\\simple.txt");
outfile.open("d:\\simple.txt");
size_t size;
size_t size; //size_t: Aliasof Unsigned integral type
infile.read((char*)&size, sizeof(size));
getline(cin, str);
str.resize(size);
size=str.size();
infile.read(&str[0], size);
outfile.write((char*)&size,sizeof(size));
cout<<str<<endl;
outfile.write(&str[0],size);
getline(cin, str);
infile.read((char*)&size, sizeof(size));
size=str.size();
str.resize(size);
outfile.write((char*)&size,sizeof(size));
infile.read(&str[0], size);
outfile.write((&str[0], size);
cout<<str<<endl;
outfile.close();
infile.close();
}
}
int main(){
simple_write_example(); str.resize(n): (shorten / extend) Resizes the string to a length of n characters.
simple_read_example();
return 0; size_t: Unsigned integral type, Alias of one of the fundamental unsigned integer types.
} is the type returned by the sizeof
void simple_read_and_write_example()
{
ex4
fstream file;
string line;
file.open("d:\\simple.txt", ios::in | ios::out|ios::trunc);
int main(){
simple_read_and_write_example();
return 0;
}
File Organization
Lec4
Dr Emad Nabil
Spring 2019
Output
--------------------
Mixing cin and (get or getline) Enter a number: 100
#include <iostream> 100
using namespace std;
Enter a character:
int main()
{ Thank You!
char ch;
int number;
Solution cin.ignore();
Mixing cin and (get or getline)
#include <iostream>
using namespace std;
int main()
{ Output
char ch; --------------------
int number; Enter a number: 100
cout << "Enter a number: "; 100
cin >> number; Enter a character: A
cout << number; A
cin.ignore(); // Skip the newline character Thank You!
cout << “\nEnter a character: ";
ch = cin.get();
cout << ch;
while (!cin.eof())
{
getline(cin, str); //Default delimiter is the newline character.
v1.push_back(str);
}
cout << "The following input was stored with newline delimiter:" << endl;
char data[100];
cin.getline(data,100, '#'); will continue
cout<<data;
reading till find
‘#’, or reach 100
char, \n will be
read as a regular
} char.
Get and put functions ex7
#include <iostream>
#include <fstream>
using namespace std;
void main(void)
{
fstream dataFile("d:\\sentence.txt", ios::out);
char ch;
cout << "Type a sentence and be sure to end it with a full stop .\n";
while (1)
{ Get and put recognize
cin.get(ch); the space and the
dataFile.put(ch);
if (ch == '.') new line
break;
}
dataFile.close();
system("pause");
}
File types
A file can be treated as
• a stream of bytes , ex: image file , sound files
• a collection of records with fields ..
Filed and Record organization
• Field: a data value, smallest unit of data with logical meaning
► Record: A group of fields that forms a logical unit
►Key: a subset of the fields in a record used to uniquely identify the record
Field Field
key
87358CARROLL ALICE IN WONDERLAND Record
03818FOLK FILE STRUCTURES Each line of the file is a record.
79733KNUTH THE ART OF COMPUTER PROGRAMMING ►Fields in each record:
– ISBN Number,
86683KNUTH SURREAL NUMBERS – Author Name,
– Book Title
18395TOLKIEN THE HOBITT
Primary and secondary keys
►Primary Key : A key that uniquely identifies a record.
►Secondary Key: Other keys that may be used for search
►Note that
• In general not every field is a key
• Keys correspond to fields, or combination of fields, that may be used
in a search
Method of field organization
1. Fixed length
2. Begin each field with its Length indicator
3. Delimiters to separate fields
4. “keyword=value” identifies each field and its content
Methods of field organization
1- Fixed length fields
• Like in our file of books (field lengths are 5,7, and 25).
spaces
• 87358CARROLL ALICE IN WONDERLAND---------
• 03818FOLK - - - FILE STRUCTURES------------------
• 86683KNUTH- - SURREAL NUMBERS---------------
• 18395TOLKIEN THE HOBITT--------------------------
Methods of field organization
2-Length indicator
Filed length
Record
• 058735807CARROLL19ALICE IN WONDERLAND
• 050381804FOLK15FILE STRUCTURES
• 058668305KNUTH15SURREAL NUMBERS
• 051839507TOLKIEN10THE HOBITT
Methods of field organization
3-using delimiters
• Add delimiter at the end of each field
• 87358|CARROLL|ALICE IN WONDERLAND|
• 03818|FOLK|FILE STRUCTURES|
• 86683|KNUTH|SURREAL NUMBERS|
• 18395|TOLKIEN|THE HOBITT|
Methods of field organization
4-using (key=value)
• ISBN=87358|AU=CARROLL|TI=ALICE IN WONDERLAND|
• ISBN=03818|AU=FOLK|TI=FILE STRUCTURES|
• ISBN=86683|AU=KNUTH|TI=SURREAL NUMBERS|
• ISBN=18395|AU=TOLKIEN|TI=THE HOBITT|
Field structures: advantages and disadvantages
Type Advantages Disadvantages
Fixed Easy to Read/Store Waste space with padding
• Delimited fields
33
Record Organization
4- Using Delimiters to separate records
* The common delimiter in the end of line, as it make the file readable
• 5-Index
Sequential access vs Direct Access
Sequential Search
– Look at records sequentially until matching record is found.
– Time is in O(n) for n records.
– Appropriate for Pattern matching, file with few records
Direct Access
– Being able to seek directly to the beginning of the record.
– Time is in O(1) for n records.
– Possible when we know the Relative Record Number (RRN):
– First record has RRN 0, the next has RRN 1, etc.
Direct access using RRN
Requires records of fixed length.
– RRN=30 (31st record)
– Record length = 101 bytes
– Byte offset = 30 × 101 = 3030
struct Inventory
{
int qty;
float price;
char desc[31];
};
void AddRecord(fstream & out)
{
Inventory record;
cin.ignore();
cout << "Enter Desc: "; cin.getline(record.desc, 31);
out.write((char*)&record, sizeof(record));
}
void readall(fstream & in)
{
Inventory record;
in.seekg(0, ios::beg);
in.read((char*)&record, sizeof(record));
do{
cout << "quantity: "<<record.qty << endl;
cout << "price: " << record.price << endl;
cout << "desc: " << record.desc << endl;
in.read((char*)&record, sizeof(record));
} while (!in.eof());
}
void UpdateRecord(fstream & io, int rrn)
{
Inventory record;
io.seekg(rrn*sizeof(record), ios::beg);
io.read((char *)&record, sizeof(record));
cout << "the data at RRR:" << rrn << " is \n";
cout << "quantity: " << record.qty << endl;
cout << "price: " << record.price << endl;
cout << "desc: " << record.desc << endl;
io.seekp(rrn*sizeof(record), ios::beg);
io.write((char*)&record, sizeof(record));
}
int main()
{
string path = "d:\\t1.txt";
fstream f(path, ios::out|ios::in | ios::trunc | ios::binary);
AddRecord(f);
AddRecord(f);
AddRecord(f);
readall(f);
UpdateRecord(f, 1);
readall(f);
f.close();
system("pause");
return 0;
}
Length indicator based record,
delimited fields
Length indicator based record, delimited fields
#include<iostream>
#include<fstream>
#include<strstream>
using namespace std; Buffer size
const int maxBuffersize = 200;
struct person
{
char last[11];//max 10 chars Data Record
char first[11]; //max 10 chars
char address[16]; //max 15 chars
char city[16]; //max 15 chars
char state[3]; //max two chars
char zip[10]; //max 9 chars
};
istream& operator>>(istream & in, person &p) Person c;
{ Cin>>c;
istrstream strbuf(buffer);
}
int main()
{
ofstream f;
f.open("d:\\t1.txt");
ifstream m;
m.open("d:\\t1.txt");
readperson(m, c);
m.close();
return 0;
}
input output
Enter last name: Hany Hany file
Enter first name: Sami Sami
Enter address: Cairo Cairo
Enter City: Nasr City Nasr City
The file in
Enter state: OK OK binary mode
Enter zip code: 55664 55664
Assignment 1
Write a c++ program to store the data of 10 books.
Book attributes are:
ISBN (5 chars), TITLE, AUTHOR NAME, PRICE, YEAR, NUM OF PAGES
- Store 10 books using delimited records, variable length fields
- assume any other information you need.
- Operations :
- Add book, delete book (by ISBN), update book (search by ISBN), print book(by title), print all
(print all books)
• Note that the space for the record is not released, but the program that
manipulates the file must include logic that checks if record is deleted or not.
• How to use the space of deleted records for storing records that are added later?
• When a record is deleted, it is marked as deleted and inserted into the AVAIL LIST.
2. Deleting Fixed-Length Records and Reclaiming Space Dynamically
– Use an AVAIL LIST as before, but take care of the variable-length difficulties
– RRN can not be used, but exact byte offset must be used
Offset of
next
deleted
record Delete record with name Mazen
hint :-1 means no other deleted records
3. Deleting Variable-Length Records and Reclaiming Space Dynamically
Sol#2 ►Return the unused space as a new available record to AVAIL LIST
– type of fragmentation: external
– suitable placement strategy: worst-fit
Ways of compacting external fragmentation
►compacting the Holes
– if two records in AVAIL LIST are adjacent, combine them into a
larger record
– Serial Devices
• Magnetic Tapes
Magnetic disks
Magnetic disks
Magnetic disks
Magnetic disks
►Magnetic disks support
direct access to a desired
location
– Tracks
– Platters
– Cylinder
– Sectors
– Disk heads
– Disk Controller
– Seek Time
– Rotational delay
Magnetic disks
►The platters spin (7200 rpm)
Disk characteristics
– Number of bytes per sector = 512 – Number of sectors per track = 63
– Number of tracks per cylinder = 16 – Number of cylinders = 4092
►The file manager is the part of the operating system responsible for
managing files
►The file manager maps the logical parts of the file into their physical
location
cluster
How to chose cluster size?
►Some OS allow the system administrator to choose the cluster size.
Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.
6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
lo hi
Binary Search
Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.
6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
lo mid hi
Binary Search
Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.
6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
lo hi
Binary Search
Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.
6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
lo mid hi
Binary Search
Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.
6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
lo hi
Binary Search
Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.
6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
lo mid hi
Binary Search
Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.
6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
lo
hi
Binary Search
Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.
6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
lo
hi
mid
Binary Search
Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.
6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
lo
hi
mid
Binary search
int binarySearch(int arr[], int value, int left, int right) {
#include<fstream>
#include<iostream>
using namespace std;
struct Student
{
char ID[5];
char Name[10];
char Address[10];
};
Binary Search Algorithm cont.
out.write((char*)&record, sizeof(record));
}
Binary Search Algorithm cont.
bool BinarySearch(fstream & file, Student &rec, char* key){
int low = 0, high = getFileLength(file) / sizeof(rec) - 1;
int guess;
while (low <= high)
{
guess = (high + low) / 2;
readRecord (file, rec, guess);
if (Equal(rec.ID, key)) //rec.ID == key
return true;
if (Greater(rec.ID, key)) //rec.ID >key
high = guess - 1;
else
low = guess + 1;
}
return false;
}
Binary Search Algorithm cont.
int main()
{
string path = "d:\\test.txt";
fstream file(path);// by default ==> ios::out|ios::in | ios::trunc
Student st;
for (int i = 0; i < 3; i++)
AddRecord(file);
►Idea
– Bring the keys to main memory plus corresponding RRN
– Do internal sorting of keys
– Rewrite the file in sorted order
Before sorting
Before sorting
3 ►Go through each record in random order (seek O(n) ) to write the file sorted in a new file.
After rewriting the sorted file
leave file unchanged, and search only in the <key, RRN> file
Self learning sites
www.quora.com
Quora - The best answer to any question
http://ocw.mit.edu/courses/index.htm
massive open online course (MOOC) provider
https://www.edx.org/
Primary Key index
Lec# 9
Presented by Dr. Emad Nabil
Spring 2018
Introduction to Indexing
►Simple indexes use simple arrays.
►An index lets us impose order on a file without rearranging the file.
indexes
Uses other fields
►If the program detects the is index is out-of-date it calls a procedure that
reconstruct the index from the data file
►This consists of
- inserting to the data file
- inserting a new record in the index.
Delete from :
– Shift all records with keys larger than the key of the deleted record to
the previous position (in main memory),O(n) or
Two solutions
– Identification Number
– Title
– Composer or Composers
– Artist or Artists
– Label (publisher)
A sample recording file contents
Secondary index Primary index
Bound to
primary key,
not offset
Search Insert
Delete Update
Secondary Key index
Lec# 10
Presented by Dr. Emad Nabil
Spring 2019
Introduction to Indexing
Uses primary key
Primary
index
Only one
indexes
Uses other fields
– Identification Number
– Title
– Composer or Composers
– Artist or Artists
– Label (publisher)
A sample recording file contents
Secondary index Primary index
Bound to
primary key,
not offset
Search Addition
Delete Update
1-Record Addition
►When adding a record, an entry must also
be added to the secondary key index.
►When accessing the file through a secondary key, the primary index file will be
checked and a deleted record can be identified.
►This results in a lot of saving when there are many secondary keys
►The deleted record still occupy space in the secondary key indexes.
►If a lot of deletions occur, we can periodically cleanup these deleted records
also from the secondary key indexes
3- Record updating
Full
records
Retrieving data using combination of secondary keys
►Use the matched list and primary key index to retrieve the two
records from the file.
Secondary keys indexes
Disadvantage :
Any
alternative
structure ?
Improving the secondary key structure: arrays of references
Ordered
by PK
Improving the secondary key structure: arrays of references
Advantages :
No redundancy !!
Any
alternative
structure ?
Improving the secondary key structure: Inverted list
Improving the secondary key structure: Inverted list
Organize the secondary key index as an index containing one entry for each key
and a pointer to a linked list of references
Improving the secondary key structure: Inverted list
Organize the secondary key index as an index containing one entry for each key
and a pointer to a linked list of references
-1
delete Mark
One # as free
binding
To delete COREA
1. Check label ID list file, there is only binding,
2. we can simply delete COREA by placing -1 in the secondary index file
3. marking the cell with index 2 as free in the label ID list file for future reuse
4. Periodically clean up the secondary index file.
5. Incase of new insertions in the label ID list file, check available free slot
(marked with #) before insertion in the end.
Improving the secondary key structure: Inverted list
Organize the secondary key index as an index containing one entry for each key
and a pointer to a linked list of references
delete 8 Delete
first one
Mark
4 # as free
bindings
To delete BEETHOVEN
1. Check label ID list file, there is 4 bindings, user should select which one to delete or select to
delete all.
2. Suppose that user selects to delete the first binding .. Mark this cell as free using ex: (#) marker
3. Change the refrence in the sec. index file to be 8.
4. Periodically clean up the sec. index file
5. Incase of new insertions in the label ID list file, check available free slot (marked with #) before
insertion in the end.
Advantages of inverted list
►Rearrangement of the secondary key index file is
only done when a new composer’s name is added
or an existing composer’s name is changed.
► Deleting all records by a composer can be done
by placing a “-1” in the reference field in the
secondary index file. Periodically cleanup label ID list file
► Deleting or adding records for a composer only
affects the LABEL ID List File.
►To improve this, keep the LABEL ID List File in main memory
Selective Indexes
►Selective Index: Index on a subset of records
►Selective index contains only some part of entire index
– provide a selective view
– useful when contents of a file fall into several categories
e.g. 20 < Age < 30 and $1000 < Salary
Binding
►In our example of indexes, when does the binding of the index to the
physical location of the record happen?
– For the primary index, binding is at the time the file is constructed.
– For the secondary index, it is at the time the secondary index is used.
Q: Why don’t we bind secondary index by physical location directly
instead of primary index?
Advantages of postponing binding
►We need small amount of reorganization when records are added or
deleted or updated.
►It is safer approach: important changes are done in one place rather
than in many places.
Disadvantages of postponing binding
►It results in slower access times
(Binary search in secondary index
+
Binary search in primary index)
Tight Binding vs. Bind-at-retrieval
1►Use Tight Binding
– When data file is nearly static
(little or no adding, deleting or updating of records)
– When rapid retrieval performance is essential.
Example: Data stored in CD-ROM should use tight binding
2►Use Bind-at-retrieval
– When record additions, deletions and updates occur many times
Simple Primary Key
Implementation
Lec11
Spting 2019
Dr Emad Nabil
Code specification
• In this lecture An implementation for primary index is presented.
struct student
struct IndexRecord
{
{
Data char ID[6];
char PK[6];
Index
char Name[51]; record
record int offset;
char Address[51];
};
};
• Data record structure is : delimited fields, length indicator records
• Deletion=> just mark the beginning of deleted record by *, then
delete the corresponding record in Index.
Ex: 20*9|Ali Ali|Giza,123.Str.|
• If the index is not up-to-date reconstruct index from the data file
• When the program starts, it should load the index into memory
student s;
cout << "\nEnter ID: "; cin.getline(s.ID, 5);
cout << "\nEnter Name: "; cin.getline(s.Name, 50); 1. Change state of index
cout << "\nEnter Address: "; cin.getline(s.Address, 50); 2. Read student data
3. Write student to datafile
char buffer[200]; 4. Add an entry to the index
strcpy_s(buffer, s.ID); strcat_s(buffer, "|"); 5. Sort index
strcat_s(buffer, s.Name); strcat_s(buffer, "|");
strcat_s(buffer, s.Address); strcat_s(buffer, "|");
short len = strlen(buffer);
sortIndex(); fout.close();
}
void SimpleIndex::deleteStudent()
{
ChangeIndexState(true);
char PK[6]; cout << "\nenter PK to delete \n"; cin.getline(PK,5);
1. Change state of index
fstream fout(dataFilePath, ios::binary | ios::out | ios::in);
2. Read student ID to be deleted
int address = IndexBinarySearch(PK); 3. Offset = Binary search on index
if (address == -1) 4. Delete student from the datafile
{ cout << "\nStudent not found"; return; } 5. Update index
next--;
fout.close();
}
void SimpleIndex::printStudent()
{
char PK[6]; cout << "\n enter PK you want to display \n"; cin.getline(PK,5);
cout << endl <<"ID: "<<s.ID <<" Name: " <<s.Name <<" Address: " <<s.Address;
fin.close(); delete []buffer;
}
void SimpleIndex::saveIndex()
{
ofstream fout(indexFilePath, ios::trunc);
for (int i = 0; i<next; i++)
{
IndexRecord temp = index[i];
fout.write((char*)&temp, sizeof(temp));
}
fout.close();
ChangeIndexState(false);// index is up to date
}
void SimpleIndex::loadIndex()
{
if(!exists(dataFilePath)) If datafile not exist create it and assign zero to
{ NotUpTodate
next = 0; ofstream fout;
fout.open(dataFilePath, ios::app | ios::out | ios::binary);
fout << 0; fout.close();
If datafile exists and NotUpTodate is false
} Load the index from the indexfile
else if (!getIndexState())
{
ifstream fin(indexFilePath); next = 0;
while(true)
{
IndexRecord temp;
fin.read((char*)&temp, sizeof(temp)); if (fin.eof()) break;
index[next] = temp; next++;
}
fin.close();
If datafile exists and NotUpTodate is true
}
else ReconstructIndex();
}
void SimpleIndex::ReconstructIndex()
{
ifstream fin(dataFilePath); next = 0;
fin.seekg(headerSize, ios::beg); //skip header
while(true)
{
IndexRecord temp;
int offset = fin.tellg();
short len; fin.read((char*)&len, sizeof(len)); if (fin.eof()) break;
char *buffer = new char [len];
fin.read(buffer, len);if (buffer[0] == '*') continue;
istringstream strbuf(buffer);
strbuf.getline(temp.PK, 5, '|');temp.offset = offset;
index[next] = temp;
next++;
}
fin.close(); delete [] buffer;
sortIndex();
}
void SimpleIndex::ChangeIndexState(bool state)
{
fstream fout(dataFilePath, ios::out| ios::in);
fout.seekp(0, ios::beg);
fout << state;
fout.close();
}
bool SimpleIndex::getIndexState()
{
bool state;
ifstream fout(dataFilePath);
fout.seekg(0, ios::beg);
fout >>state;
fout.close();
return state;
}
void main()
{
SimpleIndex i;
i.loadIndex();
i.deleteStudent();
i.saveIndex();
system("pause");
}
Execution snapshots
Reading from console
Enter ID: 100 enter PK you want to display: 97
Enter Name: Ahm Ahm ID: 97 Name: Moh Moh Address: enter PK to delete: 99
Enter Address: Giza,123.Str. Giza,123.Str. enter PK to delete: 98
Enter ID: 99 enter PK you want to display: 98 enter PK you want to display :98
Enter Name: Ali Ali ID: 98 Name: Omar Omar Address: this student not exists
Enter Address: Giza,123.Str. Giza,123.Str.
enter PK you want to display: 99
Enter ID: 98 enter PK you want to display: 101 this student not exists
Enter Name: Omar Omar ID: 101 Name: Hany Hany Address:
Enter Address: Giza,123.Str. Giza,123.Str. enter PK you want to display: 101
ID: 101 Name: Hany Hany Address: Giza,123.Str.
Enter ID: 101 enter PK you want to display: 102
Enter Name: Hany Hany this student not exists enter PK you want to display: 100
Enter Address: Giza,123.Str. ID: 100 Name: Ahm Ahm Address: Giza,123.Str.
enter PK you want to display: 100
Enter ID: 97 ID: 100 Name: Ahm Ahm Address: enter PK you want to display: 97
Enter Name: Moh Moh Giza,123.Str. ID: 97 Name: Moh Moh Address: Giza,123.Str.
Enter Address: Giza,123.Str.
Data File
Index File
First record
Offset (int = 4
bytes) Offset (115) Dec. =
(73) Hexa
PK[6] =6 char + 2 padding
How to calculate size of a struct ?
struct X
{
char a[3];
short int b;
long int c;
char d[3];
};
void main()
{
X i;
cout << sizeof(i);
}
16
Padding rules
The following typical alignments are valid for compilers from Microsoft (Visual C++), Borland/CodeGear (C++Builder), Digital Mars
(DMC), and GNU (GCC) when compiling for 32-bit x86:
•A char (one byte) will be 1-byte aligned.
•A short (two bytes) will be 2-byte aligned.
•An int (four bytes) will be 4-byte aligned.
•A long (four bytes) will be 4-byte aligned.
•A float (four bytes) will be 4-byte aligned.
•A double (eight bytes) will be 8-byte aligned on Windows and 4-byte aligned on Linux (8-byte with -malign-double compile time option).
•A long long (eight bytes) will be 4-byte aligned.
•A long double (ten bytes with C++Builder and DMC, eight bytes with Visual C++, twelve bytes with GCC) will be 8-byte aligned with C++Builder, 2-byte aligned with DMC, 8-byte aligned with
Visual C++, and 4-byte aligned with GCC.
•Any pointer (four bytes) will be 4-byte aligned. (e.g.: char*, int*)
The only notable differences in alignment for an 64-bit system when compared to a 32-bit system are:
•A long (eight bytes) will be 8-byte aligned.
•A double (eight bytes) will be 8-byte aligned.
•A long long (eight bytes) will be 8-byte aligned.
•A long double (eight bytes with Visual C++, sixteen bytes with GCC) will be 8-byte aligned with Visual C++ and 16-byte aligned with GCC.
•Any pointer (eight bytes) will be 8-byte aligned.
Rule: the structure should be a multiple of the largest alignment of any structure member
struct X {
char a[3];
Size 16
short int b;
4-bytes aligned
long int c;
char d[3];
};
After compilation the data structure will be supplemented with padding bytes to ensure a proper
alignment for each of its members
struct X {
char a[3]; char p1[1];
short int b; char p2[2];
long int c;
char d[3]; char p3[1];
};
struct ex { Size 32
Size 8 //the size of the
struct ex string s1;
4-bytes aligned class instance and
{ }; its data members
float x;
char n[1]; struct X
}; Size 3
{
1-byte aligned
char c[3];
};
• Introduction to Hashing
• Hash functions
• Distribution of records among addresses
• synonyms and collisions
• Collision resolution by progressive overflow or linear probing
Motivation
• Hashing is a useful searching technique which can be used for implementing
indexes.
• Below we show how the search time for Hashing compares to the one for
other methods
• using binary search O(log2 N)
• Hashing O(1)
What is Hashing ?
• The idea is to discover the location of a key by simply examining the
key.
• For that we need to design a hash function.
key f address
• An address space is chosen before hand.
• For example we may decide the file will have 1000 available addresses.
• LOWELL, LOCK, OLIVER, and any word with first two letters L and O will be
mapped to the same address
h(LOWELL) = h(LOCK) = h(OLIVER) = 4
• These keys are called synonyms
• The address "4" is said to be the home address of any of these keys.
2. Use extra memory, i.e. increase the size of the address space.
ex: reserve 5000 available addresses rather than 1000
8
1. Division Method
2. Multiplication Method
3. Extraction Method
4. Mid-Square Hashing
5. Folding Technique
6. Rotation
7. Universal Hashing
9
One of the required features of the hash
function is that the resultant index must
be within the table index range
In digit extraction, few digits are selected and extracted from the
key which are used as the address
13
The mid-square hashing suggests to take square of the key and extract the middle digits of the
squared key as address
The difficulty is when the key is large. As the entire key participates in the address calculation,
if the key is large, then it is very difficult to store the square of it as the square of key should not
exceed the storage limit
So mid-square is used when the key size is less than or equal to 4 digits
The difficulty of storing larger numbers square can be overcome if for squaring
we use few of digits of key instead of the whole key 14
We can select a portion of key if key is larger in size and then square the portion of it
Keys and addresses using extracting few digits, squaring them, and again extracting mid
15
5-
1
1
• Step 3. Divide by the size of the address space (preferably a prime number.)
• dividing by a number that has many small factors may result in lots of collisions.
When keys are serial, they vary in only last digit and this leads to the creation of synonyms
Rotating key would minimize this problem. This method is used along with other methods
Here, the key is rotated right by one digit and then use of folding would avoid synonym
For example, let the key be 120605, when it is rotated we get 512060
Then further the address is calculated using any other hash function
19
Some Other Hashing Methods
• Radix Transformation
• Transform the number into another base and then divide
by the maximum address.
If Hash(Key1) = Hash(Key2)
then
Key1 and Key2
are
synonyms
and collision happens
Consider the hash value is the RRN and
we working on fixed length records 21
Distribution of Records among Addresses
23
1.1. open addressing
Progressive Overflow or linear probing
H = F(key)
is the home address. If it is available we store the record, otherwise, we increase H by k,
H = (H + k) mod tableSize, (k ≥1)
Collision Resolution: Progressive Overflow
Any
suggestion !!
Collision Resolution: Progressive Overflow
• Advantage
• Simplicity
• Disadvantage
• If there are lots of collisions, clusters of records can form as in the previous
example.
1.2. Quadratic Probing
H = F(key)
H= (H + i2 )% tablesize, i≥ 𝟏
6+4 = 10
1.3. Double Hashing
Double Hashing uses nonlinear probing by computing different probe increments for different keys.
It uses two functions.
The first function computes the original address, if the slot is available (or the record is found) we stop there,
otherwise, we apply the second hashing function to compute the step value.
1. Open addressing
The first collision resolution method, open addressing, resolves collisions in the home area. When a collision
occurs, the home area addresses are searched for an open or unoccupied element where the new data can be
placed. Examples of Open Addressing Methods:
1.1. Linear probing or progressive overflow
1.2. Quadratic probing
1.3. Double hashing
34
Hashing
part2
Lec13- spring 2018
PRESENTER BY: DR EMAD NABIL
Lecture Overview
• Introduction to Hashing
• Hash functions
• Distribution of records among addresses
• synonyms and collisions
• Collision resolution by progressive overflow or linear probing
1. Open addressing
The first collision resolution method, open addressing, resolves collisions in the home area. When a collision occurs,
the home area addresses are searched for an open or unoccupied element where the new data can be placed.
Examples of Open Addressing Methods:
1.1. Linear probing
1.2. Quadratic probing
1.3. Double hashing
3
Address space
Hashing Example 0 999
Offset of Lowell =
4*Record_Size
2.Hashing with buckets
###
• In every bucket, there should be a counter that count the number of used
slots.
• Be careful of infinite loop if the whole table is full (buckets with progressive overflow)
1. Shifting records that follow a tombstone to its home address
2. Complete reorganization (Rehashing )
3. Use different collision resolution technique
3. Chained progressive overflow
Data Next
20
21
22
23
24
25
..
Data Next
20
21
22
23
24
25
..
Data Next
20
21
22
23
24
25
..
Data Next
20
21
22
23
24
25
..
Data Next
20
21
22
23
24
25
..
Data Next
20
21
22
23
24
25
..
The key behind enhancement:
grouping similar keys in one cluster
20
21
22
23
24
25
..
Data Next
20
21
22
23
24
25
..
Data Next
20
21
22
23
24
25
..
Data Next
20
21
22
23
24
25
..
Data Next
20
21
22
23
24
25
..
Two
clusters
The problem is Overlapped
that the home
address of dean,
which is the head
of a cluster, is not
free
4. Chaining with separate overflow area
Bucket size=1, may be
more than one slot in
bucket
5. Hashed index 5
Applicable
Hash value Record offset inside file to variable
length
records
Patterns of records access
1. Open addressing
The first collision resolution method, open addressing, resolves collisions in the home area. When a collision occurs,
the home area addresses are searched for an open or unoccupied element where the new data can be placed.
Examples of Open Addressing Methods:
1.1. Linear probing
1.2. Quadratic probing
1.3. Double hashing
38
Introduction to Multilevel Indexing
Lec. 14
Dr. Emad Nabil
Spring 2019
Lecture Overview
• In AVL tree: the heights of the two child subtrees of any node differ by at most
one
offset
offset
offset
offset
offset
offset
Insert record with key
smaller than all keys
Shift all
While multi-record multi-level indexes really help reduce the number of disk accesses
and their overhead space costs are minimal, inserting a new key or deleting an old one
is very costly.
The problem of indexes in insertion and deletion (shifting)
Solution 4
B-Trees: An Overview
m/2 m/2
… … Level= 2 2*m/2
… …
m/2 m/2 m/2 m/2
… …. … … …. … Level= 3
… … … … 2*m/2*m/2= 2*(m/2)2
…
…
..
Level= d
= 2*(m/2)d-1
The worst case depth occurs when every node contains has minimum number of children
A tree with N element where every node has minimum number of children max number of levels (d) =
B Tree
Lec 15
Dr Emad Nabil
Spring 2018
Formal Definition of B-Tree Properties
• In a B-Tree of order m,
• Every page has a maximum of m descendants
• Every page, except for the root and leaves, has at least 𝒎/𝟐
descendants.
• The root has at least two descendants (unless it is a leaf).
• All the leaves appear on the same level.
• The leaf level forms a complete, ordered index of the associated
data file.
• B-Tree is self-balancing
Insertion in B-Tree
• To insert a new element, search for the leaf node where the new element
should be added.
• If the node contains fewer elements than the maximum capacity. Insert this
element into it in order.
o A single median is chosen from among the leaf's elements and the new element.
o Values less than or equal the median are put in the new left node and values greater
than the median are put in the new right node.
o The median value is inserted in the node's parent, which may cause it to be split, and
so on. If the node has no parent (i.e., the node was the root), create a new root
above this node (increasing the height of the tree).
Deletion in B-Tree
Deletion from a node
Search for the value to delete
If the value is in a leaf node, simply delete it from the node.
If underflow happens, re-balance the tree as described in "Re-balancing after
deletion".
Rebalance after deletion
o Let's call the node containing the deleted element as “target node”
o If the left sibling of the target node has more than the minimum number of
elements, move the largest element from the left sibling to the target node. Update
the parent if needed. The tree is now balanced.
o Else if the right sibling of the target node has more than the minimum number of
elements, do the above step with the right sibling. The tree is now balanced.
o If both left and right siblings contain exactly the minimum number of elements.
If there is a left sibling, move all remaining elements from the target node to the left sibling.
The target node now is empty and the left sibling is full.
Else, do the above step with the right sibling.
Notes
• If order in a B-tree is m then
• Min number of nodes in root = 2
• Min number of elements in any other node = 𝑚/2
• In case of overflow the you should propagate median to parent node, in his
case Median = (𝑚 + 1)/2
Example2: Insert the following
sequence into B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4
3 4 19 20
Insert 1
1 3 4 19 20
split
4 20
1 3 4 19 20
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11 4 20
Order m =4
1 3 4 19 20
4 20
Insert 13
1 3 4 13 19 20
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11 4 20
Order m =4
1 3 4 13 19 20
Insert 16
4 20
1 3 4 13 16 19 20
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4 4 20
1 3 4 13 16 19 20
Insert 9
4 16 20
1 3 4
9 13 16 19 20
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11 4 16 20
Order m =4
1 3 4
9 13 16 19 20
Insert 2
4 16 20
1 2 3 4
9 13 16 19 20
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4 4 16 20
1 2 3 4
Insert 23 19 20
9 13 16
4 16 23
1 2 3 4 Parent is
updated
9 13 16 19 20 23
4 16 23
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4
1 2 3 4
9 13 16 19 20 23
Insert 14
4 16 23
1 2 3 4
9 13 14 16 19 20 23
Insert the following numbers in B-tree 4 16 23
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4
1 2 3 4 19 20 23
9 13 14 16
Insert 7
4 13 16 23
1 2 3 4
19 20 23
division
7 9 13 14 16
7 9 13 14 16
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4
4 13 16 23
19 20 23
1 2 3 4
7 9 13 14 16
Insert 21
4 13 16 23
1 2 3 4
19 20 21 23
7 9 13 14 16
Insert the following numbers in 4 13 16 23
B-tree: 3,19, 4, 20, 1, 13, 16, 9,
2, 23, 14, 7, 21, 18, 11
Order m =4
1 2 3 4
19 20 21 23
7 9 13
14 16
Insert 18
4 13 16 23
20 18 19 20 21 23
1 2 3 4
division
18 19 20 21 23
7 9 13 14 16
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4
16 23
4 13 16 20 23
division
4 13 16 20 23
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4
16 23
4 13 16 20 23
1 2 3 4
18 19 20 21 23
7 9 13 14 16
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4
16 23
Insert 11 4 13 16 20 23
1 2 3 4
18 19 20 21 23
7 9 11 13 14 16
Deleting a node from B tree
16 23
Delete 21 4 13 16 20 23
1 2 3 4
18 19 20 21 23
7 9 11 13 14 16
16 23
Delete 21 4 13 16 20 23
1 2 3 4
18 19 20 21 23
7 9 11 13 14 16
Underflow,
borrow
element from
left sibling
Update
parent
16 23
Delete 21 4 13 16 19 23
1 2 3 4
18 19 20 23
7 9 11 13 14 16
DONE
16 23
Delete 20 4 13 16 19 23
1 2 3 4
18 19 20 23
Underflow, can’t
7 9 11 13 14 16 borrow element from
left sibling, there is no
right sibling, solution=>
merge with left sibling
Underflow ,
borrow from
left sibling
16 23
Delete 20 4 13 16 23
1 2 3 4
18 19 23
7 9 11 13 14 16
13 23
Delete 20 4 13 16 23
1 2 3 4
18 19 23
7 9 11 13 14 16
DONE
Parent is
updated Parent is
updated
23
13
19
Delete 23 4 13 23
16
19
1 2 3 4
18 19 23
7 9 11 13 14 16
Try this @ home
Insert these numbers into B-tree
Oder = 3
Min number of nodes = 3/2 = 2
Median in case of overflow = 4/2 = 2
1–
2–file 3– I/O 4– Disk
Application
manager Processor Controller
Program
A journey of a byte
A journey of a byte
1►Application program
– Requests the I/O operation
• I/O Processor asks the disk controller if the disk drive is available for
writing
4► Disk Controller: A separate chip; instructs the drive to move R/W head
Disk Controller instructs the disk drive to move its read/write head to the
right track and sector.
• FM ask the I/O Processor to sore the data in the I/O buffer
• I/O Processor asks the disk controller if the disk drive is available for writing
• Disk Controller instructs the disk drive to move its read/write head to the right
track and sector.
Suppose in our program
we wrote: TEXT<< c;
will
A journey
of a byte
Buffer Management
► Buffering means working with large chunks of data in main memory so the
number of accesses to secondary storage is reduced.
• A new trip to get a sector of infile would only be done after all the
bytes in the previous sector had been read.
Buffering Strategies Suppose program needs to write some data on HD,
program fills buffer1 buff1 .written to HD
while writing buffer1 to HD,
1► Double Buffering program fills buffer2 buff1 .written to HD and so on.
Written to
Written to
Written to
Written to
Written to
Buffering Strategies
2►Buffer pooling:
– There is a pool of buffers.
– When a request for a sector is received, O.S. first looks to see that sector is in some
buffer.
– If not there, it brings the sector to some free buffer. If no free buffer exists, it must
choose an occupied buffer, free it, then reuse. usually LRU (least recently used )strategy
is used(replaces the page that hasn’t been referenced for the longest time). In other words, choose the buffer
that least used, free it, then reuse.
Buffering Strategies
Move Mode (using both system buffer & program buffer)
Situation in which data must be always moved from system buffer to
program buffer and vice versa, so that the OS can manipulate the
program data.
Application OS
Buffering Strategies
Locate Mode (using system buffer only or program buffer only to handle application I/O operations )
Ex: perform I/O directly between secondary storage and Example pf
using only
program buffer or system buffer system
buffers
Application OS
program buffer