Lectures 1-10 (7 Files Merged)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 386

File Organization and Processing

Lec. 1
Dr. Emad Nabil
Spring 2018
Grading Scheme
• Course work and MT (40%)
• Course work include (practical assignments, project)
• Cheating will be graded by –ve marks
• No late submission
• A final exam (60%)

• Course Page: http://www.acadox.com/class/56663#resources


Overview
• All programs should be written in C++
• Apply OOP concepts in all of your program
• Modular design
Course References

Lecture notes by Prof Lucia Moura


What is language Google use internally?
• Java, Javascript, C++, Python, Go, Sawzal (for processing log files), and probably a
few other languages are supported.
• Indexing and Search is mostly based on C++ and some Python.
• Ads is a Java/SQL stack.
• Google Apps (Gmail, Maps, Google+, …) is a Javascript/Java/C++ stack.
• Go (programming language), for highly concurrent systems
Most (but not all) low level server-side software is
written in C++. Performance is of crucial importance when Java is used extensively for higher level
you operate hundreds of thousands of servers - 10% code server-side code: consumer-facing UI, for
optimization can result in saving millions of dollars. In example. As anywhere, if you're working on UI at
extremely rare cases, some critical parts of the code are Google, JavaScript is a must.
even written in assembly.

“Python has been an important part of Google since the beginning, and remains so as the system grows and
evolves. Today dozens of Google engineers use Python, and we're looking for more people with skills in this
language”
Peter Norvig, director of search quality at Google
Purpose of the course
► Objective of Data Structures was to teach ways of efficiently organizing and
manipulating data in main memory.

► In this course, you will learn equivalent techniques for organization and
manipulation of data in secondary storage.

► In the first part of the course, you will learn about "low level" aspects of file
manipulation (basic file operations, secondary storage devices)

► In the second part of the course, you will learn the most important high-level
file structure tools (indexing, co-sequential processing, B trees, Hashing, etc…).

► You will apply these concepts in the design of C ++ programs for solving
various file management problems
Course outline
1. Fundamental File Processing Operations.
2. Sequential and direct access.
3. Secondary Storage, physical storage devices: disks, tapes and CDROM.
4. Reclaiming space in files.
5. Internal sorting, binary searching, keysorting.
6. Cosequential processing and external sorting
7. Indexing
8. Multilevel indexing and B trees
9. Hashing
Data structure VS File Structure
Both involve:
Representation of Data
+
Operations for accessing data

►Difference:
– Data Structures deal with data in main memory
– File Structures deal with data in secondary storage device (File).
Computer Architecture
Memory Hierarchy
Memory Hierarchy
►On systems with 32-bit addressing, only 232 bytes (4 Giga) can be directly referenced in main
memory.

►On systems with 64-bit addressing, 264 bytes can be directly referenced in main memory.

►The number of data objects may exceed this number!

► This requires storage devices that retain information when the computer is restarted.
How Fast
Typical times for getting info (Access)
– Main memory: ~120 nanoseconds = 120×10−9
– Magnetic Disks: ~30 milliseconds = 30×10−3

►An analogy keeping same time proportion as above


– Looking at the index of a book: 20 seconds versus
– Going to the library: 58 days
Comparison
►Main Memory

– Fast (since electronic)


– Small (since expensive)
– Volatile (information is lost when power failure occurs)

►Secondary Storage

– Slow (since electronic and mechanical)


– Large (since cheap)
– Stable, persistent (information is preserved longer)
Goal of the course
► Enhancing the search on secondary storage.

► Minimize number of trips to the disk in order to get desired


information. Ideally get what we need in one disk access or get it with
as few disk access as possible.

►Grouping related information so that we are likely to get everything


we need with only one trip to the disk (e.g. name, address, phone
number, account balance).
History of file structure
1. In the beginning… it was the tape
– Sequential access
– Access cost proportional to size of file [Analogy to sequential access to array data structure]

2. Disks became more common


– Direct access [Analogy to access to position in array]

– Indexes were invented


• list of keys and points stored in small file
• allows direct access to a large primary file

Great if index fits into main memory.


As file grows we have the same problem we had with a large primary file
History of file structure
3. Tree structures emerged for main memory (1960`s)
– Binary search trees (BST`s)
– Balanced, self adjusting BST`s: e.g. AVL trees (1963)

4. A tree structure suitable for files was invented:


B trees (1979) and B+ trees good for accessing millions of records with
3 or 4 disk accesses.

5. What about getting info with a single request?


– Hashing Tables (Theory developed over 60’s and 70’s but still a
research topic) good when files do not change too much in time.
File Organization and Processing
Lec. 2
Dr. Emad Nabil
Spring 2019
Contents
►Sample programs for file manipulation
►Physical files and logical files
►Opening and closing files
►Reading from files and writing into files
►How these operations are done in C and C++
►Standard input/output and redirection
What is a file?
A file is :

►A collection of data is placed under permanent or non-volatile


storage

►Examples: anything that you can store in a disk hard drive, tape,
optical media, and any other medium which doesn’t lose the
information when the power is turned off.
Where do file structure fit in computer science?
Physical file and logical files
► Physical file:
* physically exists on secondary storage;
* known by the operating system;
* appears in its file directory

► Logical file,
* what your program actually uses,
* a ‘pipe’ though which information can be extracted, or sent.

►Operating system:
* get instruction from program or command line;
* link logical file with physical file or device
Physical file and logical files
Physical file and logical files
Opening file
Opening files

Opening file (modes)


Opening files
Closing file
Closing file
Closing file
Reading files in C++
Writing to files in C++
Detecting End of file
Logical file names associated to standard I/O
devices
#include<iostream>
#include <fstream>
Without this line,
using namespace std; the output will
void main() has no spaces.
{
char ch;
fstream file; // declare fstream unattached
char filename[20];
cout << "Enter the name of the file: ";
cin >> filename; // Step 2
file.open(filename, ios::in); // Step 3
file.unsetf(ios::skipws); // include white space in read
while (1) //while(!file.fail())
{
file >> ch; // Step 4a
if (file.fail()) break;
cout << ch; // Step 4b
}
file.close(); // Step 5
system("pause");
}
Demo
#include<iostream>
#include <fstream> Example of reading and writing
using namespace std;

void Student::save(ofstream& of)


class Student {
{ of.write((char*)&number, sizeof(number)); //of<<number ;
private: of.write(name, sizeof(name));
int number; of.write((char*)&gpa, sizeof(gpa)); //of<<gpa ;
char name[50]; }
float gpa;
public:
Student(int n=0, const char *s="", float g=0);
void save(ofstream& of);
void load(ifstream& inf);
void print();
};

Student::Student(int n, const char *s, float g)


{
number =n ; void Student::load(ifstream& inf)
strcpy_s(name, s); {
gpa = g; inf.read((char*)&number, sizeof(number)); //inf>>number;
inf.read(name, sizeof(name));
}
inf.read((char*)&gpa, sizeof(gpa)); //inf>>gpa;
}
void Student::print()
{
cout << "\n" << number << ", " << name << ", " << gpa;
}
int main()
{
Student me(11321, "Myself", 4.3);
ofstream outfile;
outfile.open("d:\\test.dat", ios::binary | ios::out);
me.save(outfile);
outfile.close();
Student me2; //calling for default constructor
ifstream infile;
infile.open("d:\\test.dat", ios::binary | ios::in);
me2.load(infile);
me2.print();
infile.close();
system("pause");
return 0;
}
Student me(11321, "Myself", 4.3);

void Student::save(ofstream& of)


{ int number;
of.write((char*)&number, sizeof(number)); //of<<number ;
of.write(name, sizeof(name)); char name[50];
of.write((char*)&gpa, sizeof(gpa)); //of<<gpa ; float gpa;
}

File contents
Demo
File Organization and Processing
Lec. 3
Dr. Emad Nabil
Spring 2018
Revision (read/write functions)
File streams include two member functions specifically designed to read and write
binary data sequentially: write and read. The first one (write) is a member function
of ostream (inherited by ofstream). And read is a member function of istream
(inherited by ifstream). Objects of class fstream have both. Their prototypes are:

ostream& write (const char* s, streamsize n); //write ( memory_block, size );


istream& read (char* s, streamsize n); //read ( memory_block, size );

Where memory_block is of type char* (pointer to char), and represents the


address of an array of bytes where the read data elements are stored or from
where the data elements to be written.

The size parameter is an integer value that specifies the number of characters to
be read or written from/to the memory block.
Simple file
Output to File test.txt
--------------------------------------------------------
Jones This is how the file will
Smith look like
Willis
Davis
File seek operation
• A fstream has 2 file pointers:
• get pointer (for input)
• put pointer (for output)
• file1.seekg ( byte_offset, origin); //moves get pointer
• file1.seekp ( byte_offset, origin); //moves put pointer

ios::beg (beginning of file)


origin can be ios::cur (current position)
ios::end (end of file)
File seek operation

Ex:
file1.seekg ( 373, ios::beg);
meaning  moves get pointer 373 bytes from the beginning of file
File seek operation
Statement How it Affects the Read/Write Position
File.seekp(32L, ios::beg); Sets the write position to the 33rd byte (byte 32) from the beginning of the file.

file.seekp(-10L, ios::end); Sets the write position to the 11th byte (byte 10) from the end of the file.

file.seekp(120L, ios::cur); Sets the write position to the 121st byte (byte 120) from the current position.

file.seekg(2L, ios::beg); Sets the read position to the 3rd byte (byte 2) from the beginning of the file.

file.seekg(-100L, ios::end); Sets the read position to the 101st byte (byte 100) from the end of the file.

file.seekg(40L, ios::cur); Sets the read position to the 41st byte (byte 40) from the current position.

file.seekg(0L, ios::end); Sets the read position to the end of the file.
#include <iostream>
Ex-0
#include <fstream>
using namespace std;
int main( )
{ File contents:
fstream file("d:\\test.txt", ios::in);
char ch;
abcdefghijk
f
file.seekg(5L, ios::beg);
file.get(ch);
cout << "Byte 5 from beginning: " << ch << endl;

file.seekg(-5L, ios::end); g
file.get(ch);
cout << "Byte 10 from end: " << ch << endl;

file.seekg(3L, ios::cur); k
file.get(ch);
cout << "Byte 3 from current: " << ch << endl;

file.close();
}
The tellp and tellg Member Functions
• tellp returns a long integer that is the current byte number of the
file’s write position.

• tellg returns a long integer that is the current byte number of the
file’s read position.
Example: Write a C++ program, with no loops, to copy the contents of the file
d:\\source.txt to the file d:\\destination.txt, assume that destination file is empty, don’t
use any loops.
int main ()
{
ifstream infile ("source.txt",ios::binary); ex1
ofstream outfile ("destination.txt",ios::binary);
infile.seekg (0,infile.end);
long size = infile.tellg();

infile.seekg (0);//by default seeks from beginning


char* buffer = new char[size];
infile.read (buffer,size);
outfile.write (buffer,size);

delete[] buffer; outfile.close(); infile.close();


return 0;
}
getline function
public member function
<istream> <iostream>

std::istream::getline
istream& getline (char* s, streamsize n );
istream& getline (char* s, streamsize n, char delim );

A null character ('\0') is automatically appended to the written sequence if n is


the default delimiter is \n greater than zero, even if an empty string is extracted.

Examples:

char data[100];
cin.getline(data, 100);  the default delimiter is ‘\n’
cin.getline(data ,100, ‘#’);
will continue reading till find ‘#’,
ifstream infile; or reach 100 char,
infile.getline(data, 100); \n will be read as a regular char.
infile.getline(data, 100 ,‘#’);
Function <string>

std::getline (string)
istream& getline (istream& is, string& str, char delim);
istream& getline (istream& is, string& str);

Get line from stream into string, Extracts characters from is and stores them into str until the delimitation
character delim is found (or the newline character, '\n', for (2)).

the default delimiter is \n

fstream file; string line;


file.open("d:\\simple.txt", ios::in);
getline(file, line);
file.close();
the default delimiter is \n
void simple_read_example()
void simple_write_example()
{
{ ex2
ifstream file;
ofstream file;
string str;
file.open("d:\\simple.txt"); //default ios::in
string str;
//default ==> ios::out | ios::trunc /*getline has an optional third argument (character)that will end
getline's input, the default value is ‘\n’ */
file.open("d:\\simple.txt");
getline (file,str);
getline(cin, str);
cout << str<<"!\n";
str+='\n';
file.write(str.c_str(), str.size()); getline (file,str);
cout << str<<"!\n";
getline(cin, str);
str+='\n';
getline (file,str);
file.write(str.c_str(), str.size());
cout << str<<"!\n";
file.close();
getline(cin, str);
}
str+='\n';
file.write((&str[0], str.size()); int main(){
simple_write_example();
file.close(); simple_read_example();
} return 0;
}
Example 3
D:\\simple.txt

text
Length
void simple_read_example()
void simple_write_example()
{
{ ex3
ifstream infile;
ofstream outfile;
string str;
string str;
//default ios::in
//default ==> ios::out | ios::trunc
infile.open("d:\\simple.txt");
outfile.open("d:\\simple.txt");
size_t size;
size_t size; //size_t: Aliasof Unsigned integral type

infile.read((char*)&size, sizeof(size));
getline(cin, str);
str.resize(size);
size=str.size();
infile.read(&str[0], size);
outfile.write((char*)&size,sizeof(size));
cout<<str<<endl;
outfile.write(&str[0],size);

getline(cin, str);
infile.read((char*)&size, sizeof(size));
size=str.size();
str.resize(size);
outfile.write((char*)&size,sizeof(size));
infile.read(&str[0], size);
outfile.write((&str[0], size);
cout<<str<<endl;
outfile.close();
infile.close();
}
}

int main(){
simple_write_example(); str.resize(n): (shorten / extend) Resizes the string to a length of n characters.
simple_read_example();
return 0; size_t: Unsigned integral type, Alias of one of the fundamental unsigned integer types.
} is the type returned by the sizeof
void simple_read_and_write_example()
{
ex4
fstream file;
string line;
file.open("d:\\simple.txt", ios::in | ios::out|ios::trunc);

file << "This is a line 1" << endl;


file << "This is a line 2" << endl;
Function <string>

file.seekg(0, ios::beg); std::getline (string)


istream& getline (istream& is, string& str, char delim);
while (!file.eof()) istream& getline (istream& is, string& str);
{
getline(file, line); Get line from stream into string, Extracts characters from is and
cout << line << endl; stores them into str until the delimitation character delim is
} found (or the newline character, '\n', for (2)).

file.close(); the default delimiter is \n


}

int main(){
simple_read_and_write_example();
return 0;
}
File Organization
Lec4
Dr Emad Nabil
Spring 2019
Output
--------------------
Mixing cin and (get or getline) Enter a number: 100
#include <iostream> 100
using namespace std;
Enter a character:
int main()
{ Thank You!
char ch;
int number;

cout << "Enter a number: ";


cin >> number;
cout << number;

cout << “\nEnter a character: ";


ch = cin.get();
cout << ch;

cout << "Thank You!\n";


system("pause");
return 0;
}
Mixing cin and getline

the cin.get function begins


reading the keyboard buffer
where the previous input
operation stopped. That
means that cin.get reads the
newline character, without
giving the user a chance to
enter any more input

Solution cin.ignore();
Mixing cin and (get or getline)
#include <iostream>
using namespace std;
int main()
{ Output
char ch; --------------------
int number; Enter a number: 100
cout << "Enter a number: "; 100
cin >> number; Enter a character: A
cout << number; A
cin.ignore(); // Skip the newline character Thank You!
cout << “\nEnter a character: ";
ch = cin.get();
cout << ch;

cout << “\nThank You!\n";


system("pause");
return 0;
}
int main() Ctrl +Z is interpreted as
{ EOF (end-of-file) ex6
string str;
vector<string> v1;
cout << "Enter a sentence, press ENTER between sentences. (Ctrl-Z to stop): " << endl;

while (!cin.eof())
{
getline(cin, str); //Default delimiter is the newline character.
v1.push_back(str);
}
cout << "The following input was stored with newline delimiter:" << endl;

for (const string p : v1)


cout << p << endl;
ex6
cin.clear(); //clear the eof flag setted by ctrl+Z

char data[100];
cin.getline(data,100, '#'); will continue
cout<<data;
reading till find
‘#’, or reach 100
char, \n will be
read as a regular
} char.
Get and put functions ex7
#include <iostream>
#include <fstream>
using namespace std;
void main(void)
{
fstream dataFile("d:\\sentence.txt", ios::out);
char ch;
cout << "Type a sentence and be sure to end it with a full stop .\n";
while (1)
{ Get and put recognize
cin.get(ch); the space and the
dataFile.put(ch);
if (ch == '.') new line
break;
}
dataFile.close();
system("pause");
}
File types
A file can be treated as
• a stream of bytes , ex: image file , sound files
• a collection of records with fields ..
Filed and Record organization
• Field: a data value, smallest unit of data with logical meaning
► Record: A group of fields that forms a logical unit
►Key: a subset of the fields in a record used to uniquely identify the record
Field Field
key
87358CARROLL ALICE IN WONDERLAND Record
03818FOLK FILE STRUCTURES Each line of the file is a record.
79733KNUTH THE ART OF COMPUTER PROGRAMMING ►Fields in each record:
– ISBN Number,
86683KNUTH SURREAL NUMBERS – Author Name,
– Book Title
18395TOLKIEN THE HOBITT
Primary and secondary keys
►Primary Key : A key that uniquely identifies a record.
►Secondary Key: Other keys that may be used for search

►Note that
• In general not every field is a key
• Keys correspond to fields, or combination of fields, that may be used
in a search
Method of field organization
1. Fixed length
2. Begin each field with its Length indicator
3. Delimiters to separate fields
4. “keyword=value” identifies each field and its content
Methods of field organization
1- Fixed length fields
• Like in our file of books (field lengths are 5,7, and 25).
spaces
• 87358CARROLL ALICE IN WONDERLAND---------
• 03818FOLK - - - FILE STRUCTURES------------------
• 86683KNUTH- - SURREAL NUMBERS---------------
• 18395TOLKIEN THE HOBITT--------------------------
Methods of field organization
2-Length indicator
Filed length
Record
• 058735807CARROLL19ALICE IN WONDERLAND
• 050381804FOLK15FILE STRUCTURES
• 058668305KNUTH15SURREAL NUMBERS
• 051839507TOLKIEN10THE HOBITT
Methods of field organization
3-using delimiters
• Add delimiter at the end of each field

• 87358|CARROLL|ALICE IN WONDERLAND|
• 03818|FOLK|FILE STRUCTURES|
• 86683|KNUTH|SURREAL NUMBERS|
• 18395|TOLKIEN|THE HOBITT|
Methods of field organization
4-using (key=value)

• ISBN=87358|AU=CARROLL|TI=ALICE IN WONDERLAND|
• ISBN=03818|AU=FOLK|TI=FILE STRUCTURES|
• ISBN=86683|AU=KNUTH|TI=SURREAL NUMBERS|
• ISBN=18395|AU=TOLKIEN|TI=THE HOBITT|
Field structures: advantages and disadvantages
Type Advantages Disadvantages
Fixed Easy to Read/Store Waste space with padding

Width length Easy to jump ahead to Long fields require more


indicator the end of the field than 1 byte to store length
(Max is 255)
Delimited Fields May waste less space Have to check every byte
than with length-based of field against the
delimiter
Keyword Fields are self Waste space with
describing allows for keywords
missing fields
Record Organization Fixed length
record
1. Fixed length record
2. Records with fixed number of fields
3. Begin record field with its Length indicator Variable
4. Using Delimiters to separate records length record
5. Use an index to keep records addresses
Record Organization
1-Fixed length record
• Internal organization may be
• Fixed length field

• Delimited fields

• Using length indicator


Record Organization
2-Fixed number of fields
• Internal organization may be
• Delimited fields

• Using length indicator


Record Organization
3-Begin record with its Length indicator
• Internal organization may be
• Delimited fields

• Using length indicator

33
Record Organization
4- Using Delimiters to separate records

* The common delimiter in the end of line, as it make the file readable

*any other character can be used as a delimiter.


Record Organization
5-using index to keep track of records.

Keys are sorted


Fixed length vs. variable length records
File Organization and Processing
Lec. 5
Dr. Emad Nabil
Spring 2018
Methods of field organization Fixed length

03818FOLK - - - FILE STRUCTURES------------------ Length


indicator
050381804FOLK15FILE STRUCTURES
Delimited
03818|FOLK|FILE STRUCTURES| fields
ISBN=03818|AU=FOLK|TI=FILE STRUCTURES|
Key
word=value
Methods of Record Organization
• 1-Fixed length record

• 2-Fixed number of fields

• 3-Begin record field with its Length indicator

• 4- Using Delimiters to separate records

• 5-Index
Sequential access vs Direct Access
Sequential Search
– Look at records sequentially until matching record is found.
– Time is in O(n) for n records.
– Appropriate for Pattern matching, file with few records
Direct Access
– Being able to seek directly to the beginning of the record.
– Time is in O(1) for n records.
– Possible when we know the Relative Record Number (RRN):
– First record has RRN 0, the next has RRN 1, etc.
Direct access using RRN
Requires records of fixed length.
– RRN=30 (31st record)
– Record length = 101 bytes
– Byte offset = 30 × 101 = 3030

►Now, how to go directly to the byte 3030 in the file


– By seeking  f.seekg(3030, ios::beg)
Fixed length record, fixed length field
#include <iostream>
#include <fstream>
#include<string>
using namespace std;

struct Inventory
{

int qty;
float price;
char desc[31];
};
void AddRecord(fstream & out)
{

Inventory record;

cout << "Enter Quantity: "; cin >> record.qty;


cout << "Enter Price: "; cin >> record.price;

cin.ignore();
cout << "Enter Desc: "; cin.getline(record.desc, 31);

out.write((char*)&record, sizeof(record));

}
void readall(fstream & in)
{
Inventory record;
in.seekg(0, ios::beg);
in.read((char*)&record, sizeof(record));

do{
cout << "quantity: "<<record.qty << endl;
cout << "price: " << record.price << endl;
cout << "desc: " << record.desc << endl;
in.read((char*)&record, sizeof(record));
} while (!in.eof());

in.clear(); //to be able to read again.

}
void UpdateRecord(fstream & io, int rrn)
{
Inventory record;
io.seekg(rrn*sizeof(record), ios::beg);
io.read((char *)&record, sizeof(record));

cout << "the data at RRR:" << rrn << " is \n";
cout << "quantity: " << record.qty << endl;
cout << "price: " << record.price << endl;
cout << "desc: " << record.desc << endl;

cout << "enter new data\n";


cout << "Enter Quantity: "; cin >> record.qty;
cout << "Enter Price: "; cin >> record.price;
cin.ignore();
cout << "Enter Desc: "; cin.getline(record.desc, 31);

io.seekp(rrn*sizeof(record), ios::beg);
io.write((char*)&record, sizeof(record));

}
int main()
{
string path = "d:\\t1.txt";
fstream f(path, ios::out|ios::in | ios::trunc | ios::binary);

AddRecord(f);
AddRecord(f);
AddRecord(f);

readall(f);

UpdateRecord(f, 1);
readall(f);

f.close();
system("pause");
return 0;
}
Length indicator based record,
delimited fields
Length indicator based record, delimited fields
#include<iostream>
#include<fstream>
#include<strstream>
using namespace std; Buffer size
const int maxBuffersize = 200;

struct person
{
char last[11];//max 10 chars Data Record
char first[11]; //max 10 chars
char address[16]; //max 15 chars
char city[16]; //max 15 chars
char state[3]; //max two chars
char zip[10]; //max 9 chars
};
istream& operator>>(istream & in, person &p) Person c;
{ Cin>>c;

/* A null character ('\0') is automatically appended to


every attribute of the student ,so max input char should be
less than the actual length */

cout << "\n Enter last name: "; in.getline(p.last, 11);


cout << "\n Enter first name: "; in.getline(p.first, 11);
cout << "\n Enter address: "; in.getline(p.address, 16);
cout << "\n Enter City: "; in.getline(p.city, 16);
cout << "\n Enter state: "; in.getline(p.state, 3);
cout << "\n Enter zip code: "; in.getline(p.zip, 10);
return in;
}
int writeperson(ofstream & stream,person & p) Writing a
{
char buffer[maxBuffersize];
record to file
strcpy(buffer, p.last); strcat(buffer, "|");
strcat(buffer, p.first); strcat(buffer, "|");
strcat(buffer, p.address); strcat(buffer, "|");
strcat(buffer, p.city); strcat(buffer, "|");
strcat(buffer, p.state); strcat(buffer, "|");
strcat(buffer, p.zip); strcat(buffer, "|");

short length = strlen(buffer);


stream.write((char*)&length, sizeof(length));
stream.write(buffer, length);
return length;
}
void readperson(ifstream & stream, person & p) Reading a
{
record from
short length;
stream.read((char *)&length, sizeof(length)); file
char* buffer = new char[length];
stream.read(buffer, length);

istrstream strbuf(buffer);

strbuf.getline(p.last, 11, '|'); strbuf.getline(p.first, 11, '|');


strbuf.getline(p.address, 16, '|'); strbuf.getline(p.city, 16, '|');
strbuf.getline(p.state, 3, '|'); strbuf.getline(p.zip, 10, '|');

cout << p; delete [] buffer;


}
Person c;
Cout<<c;
ostream& operator<<(ostream & os, person &p)
{
os << p.last << endl;
os << p.first << endl;
os << p.address << endl;
os << p.city << endl;
os << p.state << endl;
os << p.zip << endl;
return os;

}
int main()
{
ofstream f;
f.open("d:\\t1.txt");

person c; cin >> c;


writeperson(f, c);
f.close();

ifstream m;
m.open("d:\\t1.txt");
readperson(m, c);
m.close();

return 0;
}
input output
Enter last name: Hany Hany file
Enter first name: Sami Sami
Enter address: Cairo Cairo
Enter City: Nasr City Nasr City
The file in
Enter state: OK OK binary mode
Enter zip code: 55664 55664
Assignment 1
Write a c++ program to store the data of 10 books.
Book attributes are:
ISBN (5 chars), TITLE, AUTHOR NAME, PRICE, YEAR, NUM OF PAGES
- Store 10 books using delimited records, variable length fields
- assume any other information you need.
- Operations :
- Add book, delete book (by ISBN), update book (search by ISBN), print book(by title), print all
(print all books)

- DEAD LINE Friday 10:30 PM , 1 march 2019


- Submission on Acadox only
File Organization and Processing
lec#6 :Reclaiming Spaces in Files
Dr. Emad Nabil
Spring 2018
Reclaiming Spaces in Files
• What happens if records need to be deleted?
• How to delete records and reuse the unused space?
Strategies for record deletion

1. Record Deletion and Storage Compaction

2. Deleting Fixed-Length Records and Reclaiming Space


Dynamically

3. Deleting Variable-Length Records and Reclaiming Space


Dynamically
Strategies for record deletion
1. Record Deletion and Storage Compaction

• Deletion can be done by marking a record as deleted

• Note that the space for the record is not released, but the program that
manipulates the file must include logic that checks if record is deleted or not.

• After a lot of records have been deleted, a special program is used to


squeeze (compress) the file-that is called Storage Compaction
Strategies for record deletion
2. Deleting Fixed-Length Records and Reclaiming Space Dynamically

• How to use the space of deleted records for storing records that are added later?

• Use an “AVAIL LIST”, a linked list of available records.

• A header record stores the beginning of the AVAIL LIST

• When a record is deleted, it is marked as deleted and inserted into the AVAIL LIST.
2. Deleting Fixed-Length Records and Reclaiming Space Dynamically

List Header RNN =-1

Eslam Mazen Ahmed Cramer Omar Ali


0 1 2 3 4 5

RRN The file with no deletion


2. Deleting Fixed-Length Records and Reclaiming Space Dynamically

List Header RNN =2

Eslam Mazen *-1 Cramer Omar Ali


0 1 2 3 4 5

After deleting record with RNN =2


2. Deleting Fixed-Length Records and Reclaiming Space Dynamically

List Header RNN =4

Eslam Mazen * -1 Cramer *2 Ali


0 1 2 3 4 5

After deleting record with RNN =4


2. Deleting Fixed-Length Records and Reclaiming Space Dynamically

List Header RNN =2

Eslam Mazen * -1 Cramer Hany Ali


0 1 2 3 4 5

After adding a new record  Hany


2. Deleting Fixed-Length Records and Reclaiming Space Dynamically

List Header RNN =-1

Eslam Mazen shaker Cramer Hany Ali


0 1 2 3 4 5

After adding a new record  Shaker


Strategies for record deletion
3. Deleting Variable-Length Records and Reclaiming Space Dynamically

– Use an AVAIL LIST as before, but take care of the variable-length difficulties

– The records in AVAIL LIST must store its size as a field.

– RRN can not be used, but exact byte offset must be used

– Addition of records must find a large enough record in AVAIL LIST.


3. Deleting Variable-Length Records and Reclaiming Space Dynamically

List Header =-1 Record


length
8 8 7 9 7 6
Eslam|m| Mazen|m| Amal|f| Cramer|m| Eman|f| Ali|m|
Byte
0 8 16 23 31 37 offset

File with no deletion


3. Deleting Variable-Length Records and Reclaiming Space Dynamically
Size if the
deleted
record
List Header= 8 Record
length
8 8 7 9 7 6
Eslam|m| *-1|8| Amal|f| Cramer|m| Eman|f| Ali|m|
Byte
0 8 16 23 31 37 offset

Offset of
next
deleted
record Delete record with name Mazen
hint :-1 means no other deleted records
3. Deleting Variable-Length Records and Reclaiming Space Dynamically

List Header= 23 Record


length
8 8 7 9 7 6
Eslam|m| *-1|8| Amal|f| *8|9| Eman|f| Ali|m|
Byte
0 8 16 23 31 37 offset

Delete record with name Cramer


3. Deleting Variable-Length Records and Reclaiming Space Dynamically

Add record with name L with gender male :option one


List Header= 8 Record
length
8 8 7 9 7 6
Eslam|m| *-1|8| Amal|f| L|m|- - - - Eman|f| Ali|m|
Byte
0 8 16 23 31 37 offset

The new record has length 4, the old


one has 9, so there are 5 unused
bytes internal fragmentation
3. Deleting Variable-Length Records and Reclaiming Space Dynamically

Add record with name L with gender male :option two


List Header= Record
length
8 8 7 9 7 6
Eslam|m| *-1|8| Amal|f| L|m| ----- Eman|f| Ali|m|
Byte
0 8 16 23 31 37 offset

• The new record has length 4, the old one has 9,


so there are 5 unused bytes
• return the 5 bytes to the avail list
3. Deleting Variable-Length Records and Reclaiming Space Dynamically

Add record with name L with gender male :option two


List Header= 27 Record
length
8 8 7 9 7 6
Eslam|m| *-1|8| Amal|f| L|m|*8|5| Eman|f| Ali|m|
Byte
offset
0 8 16 23 27 31 37

• If the new space is very small to store any new


record, then this space is called external
fragmentation
Types of fragmentation
We must consider two types of fragmentation within a file:
►Internal Fragmentation
– wasted space within a record.
►External Fragmentation
– space is available at AVAIL LIST, but it is so small that cannot be
reused.
3. Deleting Variable-Length Records and Reclaiming Space Dynamically

Placement strategies for new records


There are several strategies for selecting a record from AVAIL LIST
when adding a new record:
1. First-Fit Strategy
– AVAIL LIST is not sorted by size.
– First record large enough to hold new record is chosen.
► Example:
– AVAIL LIST: size=10,size=50,size=22,size=60
– record to be added: size=20
– Which record from AVAIL LIST is used for the new record?
* size=50
3. Deleting Variable-Length Records and Reclaiming Space Dynamically

Placement strategies for new records


2. Best-Fit Strategy
– AVAIL LIST is sorted by size (Ascending )
– Smallest record large enough to hold new record is chosen.
► Example:
– AVAIL LIST: size=10,size=22,size=50,size=60
– record to be added: size=20
– Which record from AVAIL LIST is used for the new record?
* size=22
3. Deleting Variable-Length Records and Reclaiming Space Dynamically

Placement strategies for new records


3. Worst-Fit Strategy
– AVAIL LIST is sorted by decreasing order of size. (descending )
– Largest record is used for holding new record;
– unused space is placed again in AVAIL LIST.
► Example:
– AVAIL LIST: size=60,size=50,size=22,size=10
– record to be added: size=20
– Which record from AVAIL LIST is used for the new record?
* size=60,
*the rest 40 bytes are returned back to avail list.
Example
►what are the possible solutions, When the added record is smaller than
the item taken from AVAIL LIST by few bytes:

Sol#1►Leave the space unused within record


– type of fragmentation: internal
– suitable placement strategy: best-fit

Sol#2 ►Return the unused space as a new available record to AVAIL LIST
– type of fragmentation: external
– suitable placement strategy: worst-fit
Ways of compacting external fragmentation
►compacting the Holes
– if two records in AVAIL LIST are adjacent, combine them into a
larger record

►Minimize fragmentation by using one of the previously mentioned


placement strategies
– for example: worst-fit strategy is better than best-fit strategy in
terms of external fragmentation when unused space is returned to
AVAIL LIST
File organization
Secondary Storage Devices
Lec#7
Presenter: Dr Emad Nabil
Spring 2019
Contents
►Secondary storage devices
►Organization of disks
►Organizing tracks by sector
►The cost of a disk access
►Disk as a bottleneck
Secondary Storage Devices
Two major types of storage devices

– Direct Access Storage Devices (DASDs)


• Magnetic Disks : Hard Disks (high capacity, low cost per bit)
• Optical Disks : CD-ROM,DVD-ROM :(Read-only/write-once, holds a lot of data, cheap)

– Serial Devices
• Magnetic Tapes
Magnetic disks
Magnetic disks
Magnetic disks
Magnetic disks
►Magnetic disks support
direct access to a desired
location

– Tracks
– Platters
– Cylinder
– Sectors
– Disk heads
– Disk Controller
– Seek Time
– Rotational delay
Magnetic disks
►The platters spin (7200 rpm)

►The arm assembly is moved in or out to


position a head on a desired track. Tracks under
heads make a cylinder (imaginary!).

►Only one head reads/writes at any one time


Cylinder
• Cylinder: the set of tracks on a disk that are
directly above/below each other

• All the information on a cylinder can be


accessed without moving the read/write
arm (seeking)
Disk capacity

►Track capacity = number of sector per track × bytes per sector


►Cylinder capacity = number of surfaces × track capacity
►Drive capacity = number of cylinders × cylinder capacity

Hint :Number of cylinders = number of tracks in a surface


Disk capacity
►Disk characteristics or specifications:
►Floppy disk (FD).
►80 track per surface.
►100 sector per track.
►Sector size = 512 byte.
►A. Compute track , cylinder , surface , and disk capacities.
► Track capacity = #of sectors/track*sector capacity = 100 * 512 (byte) = 51200 B = 50KB.
► Cylinder cap. = # of tracks/ cylinder * track capacity = 2 * 50 = 100 KB.
► disk cap. = # of cylinders * cylinder capacity = 80 *100 = 8000 KB
Disk capacity
File characteristics
►We have fixed-length records
►Number of records = 50.000 records
►Size of a record = 256 bytes

Disk characteristics
– Number of bytes per sector = 512 – Number of sectors per track = 63
– Number of tracks per cylinder = 16 – Number of cylinders = 4092

►How many cylinders are needed?


Cylinder capacity = number of surfaces × track capacity =16* 63*512 =516,096 bytes
Number of required cylinders = file size/cylinder capacity =50,000*256/ 516,096 = 24.8
cylinders
Organizing tracks )sectors, cluster, extent, fragmentation)

►The file manager is the part of the operating system responsible for
managing files

►The file manager maps the logical parts of the file into their physical
location

►A cluster is a fixed number of contiguous sectors


Organizing tracks )sectors, cluster, extent, fragmentation)

►The file manager allocates an integer number of


clusters to a file.

An example: Sector size: 512 bytes, Cluster size: 2 sectors

– If a file contains 10 bytes, a cluster is allocated (1024 bytes).


– There may be unused space in the last cluster of a file.
– This unused space contributes to internal fragmentation
Organizing tracks )sectors, cluster, extent, fragmentation)
►Clusters are good since they improve sequential access:
reading bytes sequentially from a cluster can be done in
one revolution, seeking only once.
►The file manager maintains a file allocation table (FAT)
containing for each cluster in the file and its location in disk

►An extent is a group of contiguous clusters.


► If file is stored in a single extent then seeking is done
only once.
►If there is not enough contiguous clusters to hold a file,
the file is divided into 2 or more extents.
– If the file size is not multiple of the cluster size, then the
last cluster will be partially used. (source of fragmentation)
illustration

cluster
How to chose cluster size?
►Some OS allow the system administrator to choose the cluster size.

►When to use large cluster size?


– When disks contain large files likely to be processed sequentially.
– Example: Updates in a master file of bank accounts (in batch mode)

►What about small cluster size?


– When disks contain small files and/or files likely to be accessed randomly
– Example : online updates for airline reservation
The bottleneck
►When a program reads a byte from the disk, the operating system locates
the surface,
the track,
the sector
containing that byte, and reads the entire sector into a special area in main
memory called buffer.

►The bottleneck of a disk access is moving the read/write arm.


So, it makes sense to store a file in tracks that are below/above each other
in different surfaces (cylinder), rather than in several tracks in the same
surface.
Disk Access cost
►Time to access (read/write) a disk block
– seek time (moving arms to position disk head on track)
– rotational delay (waiting for block to rotate under head)
– transfer time (actually moving data to/from disk surface)

►Seek time and rotational delay dominate


– Seek time varies from 1 to 20 msec
– Rotational delay varies from 1 to 10 msec
– Transfer rate is about 1msec for 4KB page

►Key to lower I/O cost: reduce seek/rotation delays:


Hardware vs. Software solutions?
Disk as a bottleneck
► Processes are often disk-bound
► network and CPU have to wait a long time for the disk to transmit data
► Various techniques to solve this problem examples are :
1. Multiprocessing: (CPU works on other jobs while waiting for the disk)
2. Disk Striping:
► Putting different blocks of the file in different drives.
► Independent processes accessing the same file may not interfere with each other (parallelism)
3. Cache memory :
► Large block of memory configured to contain pages of data from a disk.
► When data is requested from disk, first the cache is checked.
► If data is not there (miss) the disk is accessed
Binary Searching,
Key Sorting & Indexing
Lec#8
Presented by Dr. Emad Nabil
Spring 2019
Contents
►Binary Searching
►Key sorting
►Introduction to Indexing
Binary Search
►Let us consider fixed-length records that must be searched by a key
value
►If we knew the RRN of the record identified by this key value, we could
jump directly to the record (by using seek function)
►In practice, we do not have this information and we must search for the
record containing this key value
►If the file is not sorted by the key value we may have to look at every
possible record before we find the desired record
►An alternative to this is to maintain the file sorted by key value and use
binary searching
Binary Search

Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.

Invariant. Algorithm maintains a[lo]  value  a[hi].

Ex. Binary search for 33.

6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

lo hi
Binary Search

Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.

Invariant. Algorithm maintains a[lo]  value  a[hi].

Ex. Binary search for 33.

6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

lo mid hi
Binary Search

Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.

Invariant. Algorithm maintains a[lo]  value  a[hi].

Ex. Binary search for 33.

6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

lo hi
Binary Search

Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.

Invariant. Algorithm maintains a[lo]  value  a[hi].

Ex. Binary search for 33.

6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

lo mid hi
Binary Search

Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.

Invariant. Algorithm maintains a[lo]  value  a[hi].

Ex. Binary search for 33.

6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

lo hi
Binary Search

Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.

Invariant. Algorithm maintains a[lo]  value  a[hi].

Ex. Binary search for 33.

6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

lo mid hi
Binary Search

Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.

Invariant. Algorithm maintains a[lo]  value  a[hi].

Ex. Binary search for 33.

6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

lo
hi
Binary Search

Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.

Invariant. Algorithm maintains a[lo]  value  a[hi].

Ex. Binary search for 33.

6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

lo
hi
mid
Binary Search

Binary search. Given value and sorted array a[], find index i
such that a[i] = value, or report that no such index exists.

Invariant. Algorithm maintains a[lo]  value  a[hi].

Ex. Binary search for 33.

6 13 14 25 33 43 51 53 64 72 84 93 95 96 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

lo
hi
mid
Binary search
int binarySearch(int arr[], int value, int left, int right) {

while (left <= right) {


int middle = (left + right) / 2;
if (arr[middle] == value)
return middle;
else if (arr[middle] > value)
right = middle - 1;
else
left = middle + 1;
}
return -1;
}

•Worst case performance O(log n)


•Best case performance O(1)
•Average case performance O(log n)
Binary Search Algorithm

#include<fstream>
#include<iostream>
using namespace std;

struct Student
{
char ID[5];
char Name[10];
char Address[10];
};
Binary Search Algorithm cont.

void readRecord(fstream & file, Student &rec, int rnn)


{
file.seekg(rnn*sizeof(rec), ios::beg);
file.read((char*)&rec, sizeof(rec));
}

bool Equal(char *key1, char *key2)


{
return (strcmp(key1, key2) == 0);
}

bool Greater(char *key1, char *key2)


{
return (strcmp(key1, key2) > 0);
}
Binary Search Algorithm cont.
int getFileLength(fstream & file)
{
file.seekg(0, ios::end);
int size = file.tellg();
return size;
}

void AddRecord(fstream & out)


{
Student record;
cout << "Enter ID: "; cin.getline(record.ID,5);
cout << "Enter Name: "; cin.getline(record.Name, 10);
cout << "Enter Addess: "; cin.getline(record.Address, 10);

out.write((char*)&record, sizeof(record));
}
Binary Search Algorithm cont.
bool BinarySearch(fstream & file, Student &rec, char* key){
int low = 0, high = getFileLength(file) / sizeof(rec) - 1;
int guess;
while (low <= high)
{
guess = (high + low) / 2;
readRecord (file, rec, guess);
if (Equal(rec.ID, key)) //rec.ID == key
return true;
if (Greater(rec.ID, key)) //rec.ID >key
high = guess - 1;
else
low = guess + 1;
}
return false;
}
Binary Search Algorithm cont.
int main()
{
string path = "d:\\test.txt";
fstream file(path);// by default ==> ios::out|ios::in | ios::trunc

Student st;
for (int i = 0; i < 3; i++)
AddRecord(file);

char k[5]; cin.getline(k,5);


if (BinarySearch(file, st, k))
{
cout << "The data of student with ID: " << k << endl;
cout << st.Name << endl;
cout << st.Address << endl;
}
else cout << "student not found" << endl;
file.close();
return 0;
}
Sample Run
The file after entering 3 students’ data

The data of student with ID: 2222 is


omar Sample Run
alex
Binary Search time complexity
►Sequential Search: O(n)
►Binary Search: O(log2n)
►If file size is doubled, sequential search time is doubled,
while binary search time increases by 1
Requirements for binary search Key sorting
►Suppose a file needs to be sorted, but it is too big to fit into main
memory.

►To sort the file, we only need the keys.

►Suppose that all the keys fit into main memory

►Idea
– Bring the keys to main memory plus corresponding RRN
– Do internal sorting of keys
– Rewrite the file in sorted order
Before sorting
Before sorting

1: Read file sequentially once (to read the keys in memory)


Before sorting

2: sort the keynodes array


After sorting

New file: sorted

3 ►Go through each record in random order (seek O(n) ) to write the file sorted in a new file.
After rewriting the sorted file

Old file: not sorted


New file: sorted

4►Write each record once (sequentially).. To replace the original one


Finally the file is sorted
Disk Cost of the previous operations.
1►Read file sequentially once (to read the keys in memory)
► Sort array in memory
2►Go through each record in random order (seek) to write the file sorted in a new file.
3►Write each record once (sequentially).. To replace the original one

What is your suggestion to enhance the


performance of the previous operations?
Indexing

leave file unchanged, and search only in the <key, RRN> file
Self learning sites

www.quora.com
Quora - The best answer to any question

http://ocw.mit.edu/courses/index.htm
massive open online course (MOOC) provider

Massachusetts Institute of Technology


Self learning sites
massive open online course ( MOOC) provider
https://www.coursera.org/
Take the world's best courses, online.

https://www.edx.org/
Primary Key index
Lec# 9
Presented by Dr. Emad Nabil
Spring 2018
Introduction to Indexing
►Simple indexes use simple arrays.

►An index lets us impose order on a file without rearranging the file.

►Indexes provide multiple access paths to a file ─ multiple indexes (like


library catalog providing search for author, book and title)

►An index can provide keyed access to variable-length record files


Introduction to Indexing
Uses primary key
Primary
index
Only one

indexes
Uses other fields

Secondary May be more


index than one
Useful in
searching
A sample recording file contents
Index of the recording file (Variable-length)
key = label + ID Index file Data file
Introduction to Indexing
►Index is sorted (main memory)

►Records appear in file in the order they entered

►How to search for a recording with given LABEL ID?


– Binary search (in main memory) in the index: find LABEL ID, which leads us to the
referenced field
– Seek for record in position given by the reference field
Important issues
►How to make a persistent index
- Write index to a file

►How to guarantee that the index is an accurate reflection of the contents


of the file after a lots of additions, deletions and updates
- update the index file
Operations in order to maintain an Indexed File

New data file and new index file


1. Create an empty data file and empty index file.
2. Perform the operation(add/delete/update) on data file, then on index file
3. Save both the data and index files to secondary storage.

Reading data file and index file


1. Load the index file into memory before using it.
2. Perform the operation(add/delete/update) on data file, then on index file
3. Save both the data and index files to secondary storage.
Rewrite index from main memory
►When the data file is closed, the index in memory needs to be written
to the index file.

►An important issue to consider is what happens if the rewriting does


not take place (ex: power failures , etc.)
Rewrite index from main memory
►Keep an status flag stored in the header of the index file.
– if the status flag is “on”  means that index file is not up-to-date.
– When changes are performed in the index residing on main memory the status flag
in the file is turned on.
– Whenever the file is written from main memory the status flag is turned off.

►If the program detects the is index is out-of-date it calls a procedure that
reconstruct the index from the data file

if the status flag is “on”  index file is NOT up-to-date.


if the status flag is “off”  index file is up-to-date.
1- Record addition

►This consists of
- inserting to the data file
- inserting a new record in the index.

►The rearrangement of the index consists of “sliding down” the records


with keys larger than the inserted key and then placing the new record in
the opened space. O(n)

►Note that this rearrangement is done in main memory (fast process)


2- Record deletion

Delete from :

1-The Data file: using techniques for reclaiming space in files

2-The index file:

– Shift all records with keys larger than the key of the deleted record to
the previous position (in main memory),O(n) or

– Mark the index entry as deleted. O(1)


3- Record updating

-update data file


Update key -update index then sort it, no change of offset in index.

New record size = old record size then :


Update fields -update data file
other than key New record size != old record size then :
-delete and insert (data file)
-update offset of the record in index (no need to resort it)
4- Record searching

• Binary search in index file (array)

• Seek in data file, then read record.


Indexes are large to fit in main memory
►The indexes that we have considered before could fit into main memory.

►If this is not the case, we have the following problems:


– Binary searching of the index file is done on disk, involving several “fseek()” calls
– Index rearrangement (addition or deletion) requires shifting on disk

Two solutions

Tree-structured index  B-tree Hashed index Multi level indexing


Secondary indexes
Suppose that you are looking at a collection of recordings with the
following information about each of them:

– Identification Number
– Title
– Composer or Composers
– Artist or Artists
– Label (publisher)
A sample recording file contents
Secondary index Primary index

Bound to
primary key,
not offset

This means that the


primary key index
must be searched
also
4 operations on secondary indexes

Search Insert

Delete Update
Secondary Key index
Lec# 10
Presented by Dr. Emad Nabil
Spring 2019
Introduction to Indexing
Uses primary key
Primary
index
Only one

indexes
Uses other fields

Secondary May be more


index than one
Useful in
searching
Secondary indexes
Suppose that you are looking at a collection of recordings with the
following information about each of them:

– Identification Number
– Title
– Composer or Composers
– Artist or Artists
– Label (publisher)
A sample recording file contents
Secondary index Primary index

Bound to
primary key,
not offset

This means that the


primary key index
must be searched
also
4 operations on secondary indexes

Search Addition

Delete Update
1-Record Addition
►When adding a record, an entry must also
be added to the secondary key index.

►Store the field in Canonical (standard)


Form

►There may be duplicates in secondary keys.

► Keep duplicates in sorted order of primary


key.
2-Record deletion
►Deleting a record implies removing
all the references to the record in the
primary index and in all the secondary
indexes.

►This is too much rearrangement,


specially if indexes cannot fit into main
memory
Alternative way for record deletion
►Delete the record from the data file and the primary index file reference to it.

► Do not modify the secondary index files.

►When accessing the file through a secondary key, the primary index file will be
checked and a deleted record can be identified.

►This results in a lot of saving when there are many secondary keys

►The deleted record still occupy space in the secondary key indexes.

►If a lot of deletions occur, we can periodically cleanup these deleted records
also from the secondary key indexes
3- Record updating

New key size = old key size:


Update PK key -update data file
-update PK index then sort it, no change of offset in index.
-update SK index

New key size = old key size, then :


-update data file
-update SK index then sort it, no change of reference to PK index.
Update SK key
Key new size != key old size
-delete and insert (data file)
-update offset in (PK index)
-update SK index and keep it sorted it, no change of reference to PK index.
New record size = old record size then :
Update fields -update data file
other than key New record size != old record size then :
-delete and insert (data file)
-update offset of the record in PK index (no need to resort it)
- No change in SK index
4-Retrieving data using combination of secondary keys
►Secondary key indexes are useful in allowing the following kinds of queries:
– Find all records with composer “BEETHOVEN” and title “Symphony No.9”
– Find all records with composer “BEETHOVEN”  secondary index #1
– Find all records with the title “Violin Concerto”  secondary index #2
Retrieving data using combination of secondary keys

Full
records
Retrieving data using combination of secondary keys
►Use the matched list and primary key index to retrieve the two
records from the file.
Secondary keys indexes
Disadvantage :

A lot of redundancy !! wasting space

Any
alternative
structure ?
Improving the secondary key structure: arrays of references
Ordered
by PK
Improving the secondary key structure: arrays of references

Advantages :
No redundancy !!

Any
alternative
structure ?
Improving the secondary key structure: Inverted list
Improving the secondary key structure: Inverted list
Organize the secondary key index as an index containing one entry for each key
and a pointer to a linked list of references
Improving the secondary key structure: Inverted list
Organize the secondary key index as an index containing one entry for each key
and a pointer to a linked list of references

-1
delete Mark
One # as free
binding

To delete COREA
1. Check label ID list file, there is only binding,
2. we can simply delete COREA by placing -1 in the secondary index file
3. marking the cell with index 2 as free in the label ID list file for future reuse
4. Periodically clean up the secondary index file.
5. Incase of new insertions in the label ID list file, check available free slot
(marked with #) before insertion in the end.
Improving the secondary key structure: Inverted list
Organize the secondary key index as an index containing one entry for each key
and a pointer to a linked list of references

delete 8 Delete
first one

Mark
4 # as free
bindings

To delete BEETHOVEN
1. Check label ID list file, there is 4 bindings, user should select which one to delete or select to
delete all.
2. Suppose that user selects to delete the first binding .. Mark this cell as free using ex: (#) marker
3. Change the refrence in the sec. index file to be 8.
4. Periodically clean up the sec. index file
5. Incase of new insertions in the label ID list file, check available free slot (marked with #) before
insertion in the end.
Advantages of inverted list
►Rearrangement of the secondary key index file is
only done when a new composer’s name is added
or an existing composer’s name is changed.
► Deleting all records by a composer can be done
by placing a “-1” in the reference field in the
secondary index file. Periodically cleanup label ID list file
► Deleting or adding records for a composer only
affects the LABEL ID List File.

►Rearrangement of the secondary index file is


quicker since it is smaller

►The LABEL ID List File never needs to be sorted


since it is entry sequenced.

►We can easily reuse space from deleted records


from the LABEL ID List File since its records have
fixed-length.
Disadvantages of inverted list
►Lost of “locality”: labels of recordings with same secondary key are
not contiguous in the LABEL ID List File (seeking).

►To improve this, keep the LABEL ID List File in main memory
Selective Indexes
►Selective Index: Index on a subset of records
►Selective index contains only some part of entire index
– provide a selective view
– useful when contents of a file fall into several categories
e.g. 20 < Age < 30 and $1000 < Salary
Binding
►In our example of indexes, when does the binding of the index to the
physical location of the record happen?

– For the primary index, binding is at the time the file is constructed.
– For the secondary index, it is at the time the secondary index is used.

Q: Why don’t we bind secondary index by physical location directly
instead of primary index?
Advantages of postponing binding
►We need small amount of reorganization when records are added or
deleted or updated.

►It is safer approach: important changes are done in one place rather
than in many places.
Disadvantages of postponing binding
►It results in slower access times
(Binary search in secondary index
+
Binary search in primary index)
Tight Binding vs. Bind-at-retrieval
1►Use Tight Binding
– When data file is nearly static
(little or no adding, deleting or updating of records)
– When rapid retrieval performance is essential.
Example: Data stored in CD-ROM should use tight binding

2►Use Bind-at-retrieval
– When record additions, deletions and updates occur many times
Simple Primary Key
Implementation
Lec11
Spting 2019
Dr Emad Nabil
Code specification
• In this lecture An implementation for primary index is presented.
struct student
struct IndexRecord
{
{
Data char ID[6];
char PK[6];
Index
char Name[51]; record
record int offset;
char Address[51];
};
};
• Data record structure is : delimited fields, length indicator records
• Deletion=> just mark the beginning of deleted record by *, then
delete the corresponding record in Index.
Ex: 20*9|Ali Ali|Giza,123.Str.|

• Addition  always append in the end.


Code specification
• The index maxsize = 100.
• The data file has a header contains index status byte
• False  the index is up-to-date
• True the index is not up-to-date

• If the index is not up-to-date  reconstruct index from the data file

• When the program starts, it should load the index into memory

• When program terminates normally, it should save index to index file,


and mark the status byte to be false.

• In case of abnormal termination  mark the status byte to be true


Data File
ID Name Address
100 Ahm Ahm Giza,123.Str.
99 Ali Ali Giza,123.Str.
98 Omar Omar Giza,123.Str.
101 Hany Hany Giza,123.Str.
Not 97 Moh Moh Giza,123.Str.
sorted
Index File
ID Offset
97 0
98 --
99 --
100 ---
sorted 101 ----
Hint :
1. the code included in the PowerPoint works only on Visual studio

2. the attached source code work on standard C++


(if you don’t use visual studio, use the attached code to this lecture)
class SimpleIndex This code is tested using MS Visual studio 2015
{
int headerSize; int next;//next free slot in index
struct student
{ char ID[6];char Name[51];char Address[51]; };
struct IndexRecord
{ char PK[6];int offset;};
const static int indexSize = 100;
IndexRecord index[indexSize];
char dataFilePath[20]="D:\\dataFile.txt";
char indexFilePath[20]="D:\\indexFile.txt";
public:
SimpleIndex();
int IndexBinarySearch(char key[]);
void printStudent(); void insertStudent();
void deleteStudent(); void saveIndex();
void loadIndex(); void ReconstructIndex();
void sortIndex(); void ChangeIndexState(bool state);
bool getIndexState(); bool exists(char name[20]);
};
SimpleIndex::SimpleIndex()
{
next = 0;
headerSize = 1;
}
bool SimpleIndex::exists(char name[20] )
{
ifstream f(name);
if (f.fail())
{
return false;
}
else
{
f.close();
return true;
}
}
void SimpleIndex::sortIndex() //bubble sort
{
int len = next - 1;
IndexRecord temp;
for (int i = 0; i<len; i++)
for (int j = 0; j<len - i; j++)
{
if (atoi(index[j].PK)>atoi(index[j + 1].PK))
{
temp = index[j];
index[j] = index[j + 1];
index[j + 1] = temp;
}
}
}
int SimpleIndex::IndexBinarySearch(char key[])
{

int size = next;


int low = 0, high = size - 1;
while (low <= high)
{
int middle = (low + high) / 2;
if (strcmp(index[middle].PK, key) == 0)
return middle;
else if (atoi(index[middle].PK)<atoi(key))
low = middle + 1;
else
high = middle - 1;
}
return -1;
}
void SimpleIndex::insertStudent()
{
ChangeIndexState(true);
ofstream fout; fout.open(dataFilePath, ios::app|ios::out | ios::binary);

student s;
cout << "\nEnter ID: "; cin.getline(s.ID, 5);
cout << "\nEnter Name: "; cin.getline(s.Name, 50); 1. Change state of index
cout << "\nEnter Address: "; cin.getline(s.Address, 50); 2. Read student data
3. Write student to datafile
char buffer[200]; 4. Add an entry to the index
strcpy_s(buffer, s.ID); strcat_s(buffer, "|"); 5. Sort index
strcat_s(buffer, s.Name); strcat_s(buffer, "|");
strcat_s(buffer, s.Address); strcat_s(buffer, "|");
short len = strlen(buffer);

fout.seekp(0, ios::end); int addess = fout.tellp();


fout.write((char*)&len, sizeof(len)); fout.write(buffer, len);

IndexRecord temp; strcpy_s(temp.PK, s.ID); temp.offset = addess;


index[next] = temp; next++;

sortIndex(); fout.close();
}
void SimpleIndex::deleteStudent()
{
ChangeIndexState(true);
char PK[6]; cout << "\nenter PK to delete \n"; cin.getline(PK,5);
1. Change state of index
fstream fout(dataFilePath, ios::binary | ios::out | ios::in);
2. Read student ID to be deleted
int address = IndexBinarySearch(PK); 3. Offset = Binary search on index
if (address == -1) 4. Delete student from the datafile
{ cout << "\nStudent not found"; return; } 5. Update index

fout.seekp(index[address].offset +2, ios::beg); fout << '*';

for (int i = address; i<next - 1; i++) //shift up to delete from index


index[i] = index[i + 1];

next--;

fout.close();

}
void SimpleIndex::printStudent()
{
char PK[6]; cout << "\n enter PK you want to display \n"; cin.getline(PK,5);

ifstream fin; fin.open(dataFilePath, ios::binary | ios::in);

int address = IndexBinarySearch(PK);


if (address == -1) 1. Read PK
2. Offset = binary search on index
{ cout << "\nthis student not exists"; return; } 3. Read student from datafile

short len; student s;


fin.seekg(index[address].offset, ios::beg); fin.read((char*)&len, sizeof(len));

char *buffer= new char[len];


fin.read(buffer, len); istringstream stream(buffer);

stream.getline(s.ID, 5, '|'); stream.getline(s.Name, 50, '|');


stream.getline(s.Address, 50, '|');

cout << endl <<"ID: "<<s.ID <<" Name: " <<s.Name <<" Address: " <<s.Address;
fin.close(); delete []buffer;
}
void SimpleIndex::saveIndex()
{
ofstream fout(indexFilePath, ios::trunc);
for (int i = 0; i<next; i++)
{
IndexRecord temp = index[i];
fout.write((char*)&temp, sizeof(temp));
}
fout.close();
ChangeIndexState(false);// index is up to date
}
void SimpleIndex::loadIndex()
{
if(!exists(dataFilePath)) If datafile not exist  create it and assign zero to
{ NotUpTodate
next = 0; ofstream fout;
fout.open(dataFilePath, ios::app | ios::out | ios::binary);
fout << 0; fout.close();
If datafile exists and NotUpTodate is false 
} Load the index from the indexfile
else if (!getIndexState())
{
ifstream fin(indexFilePath); next = 0;
while(true)
{
IndexRecord temp;
fin.read((char*)&temp, sizeof(temp)); if (fin.eof()) break;
index[next] = temp; next++;
}
fin.close();
If datafile exists and NotUpTodate is true
}
else ReconstructIndex();
}
void SimpleIndex::ReconstructIndex()
{
ifstream fin(dataFilePath); next = 0;
fin.seekg(headerSize, ios::beg); //skip header
while(true)
{
IndexRecord temp;
int offset = fin.tellg();
short len; fin.read((char*)&len, sizeof(len)); if (fin.eof()) break;
char *buffer = new char [len];
fin.read(buffer, len);if (buffer[0] == '*') continue;
istringstream strbuf(buffer);
strbuf.getline(temp.PK, 5, '|');temp.offset = offset;
index[next] = temp;
next++;
}
fin.close(); delete [] buffer;
sortIndex();
}
void SimpleIndex::ChangeIndexState(bool state)
{
fstream fout(dataFilePath, ios::out| ios::in);
fout.seekp(0, ios::beg);
fout << state;
fout.close();
}
bool SimpleIndex::getIndexState()
{
bool state;
ifstream fout(dataFilePath);
fout.seekg(0, ios::beg);
fout >>state;
fout.close();
return state;
}
void main()
{
SimpleIndex i;
i.loadIndex();

for (int n = 0; n < 5;n++)


i.insertStudent();

for (int n = 0; n < 5; n++)


i.printStudent();

i.deleteStudent();

for (int n = 0; n < 5; n++)


i.printStudent();

i.saveIndex();
system("pause");
}
Execution snapshots
Reading from console
Enter ID: 100 enter PK you want to display: 97
Enter Name: Ahm Ahm ID: 97 Name: Moh Moh Address: enter PK to delete: 99
Enter Address: Giza,123.Str. Giza,123.Str. enter PK to delete: 98

Enter ID: 99 enter PK you want to display: 98 enter PK you want to display :98
Enter Name: Ali Ali ID: 98 Name: Omar Omar Address: this student not exists
Enter Address: Giza,123.Str. Giza,123.Str.
enter PK you want to display: 99
Enter ID: 98 enter PK you want to display: 101 this student not exists
Enter Name: Omar Omar ID: 101 Name: Hany Hany Address:
Enter Address: Giza,123.Str. Giza,123.Str. enter PK you want to display: 101
ID: 101 Name: Hany Hany Address: Giza,123.Str.
Enter ID: 101 enter PK you want to display: 102
Enter Name: Hany Hany this student not exists enter PK you want to display: 100
Enter Address: Giza,123.Str. ID: 100 Name: Ahm Ahm Address: Giza,123.Str.
enter PK you want to display: 100
Enter ID: 97 ID: 100 Name: Ahm Ahm Address: enter PK you want to display: 97
Enter Name: Moh Moh Giza,123.Str. ID: 97 Name: Moh Moh Address: Giza,123.Str.
Enter Address: Giza,123.Str.
Data File
Index File
First record

Offset (int = 4
bytes) Offset (115) Dec. =
(73) Hexa
PK[6] =6 char + 2 padding
How to calculate size of a struct ?
struct X
{
char a[3];
short int b;
long int c;
char d[3];
};
void main()
{
X i;
cout << sizeof(i);
}

16
Padding rules
The following typical alignments are valid for compilers from Microsoft (Visual C++), Borland/CodeGear (C++Builder), Digital Mars
(DMC), and GNU (GCC) when compiling for 32-bit x86:
•A char (one byte) will be 1-byte aligned.
•A short (two bytes) will be 2-byte aligned.
•An int (four bytes) will be 4-byte aligned.
•A long (four bytes) will be 4-byte aligned.
•A float (four bytes) will be 4-byte aligned.
•A double (eight bytes) will be 8-byte aligned on Windows and 4-byte aligned on Linux (8-byte with -malign-double compile time option).
•A long long (eight bytes) will be 4-byte aligned.
•A long double (ten bytes with C++Builder and DMC, eight bytes with Visual C++, twelve bytes with GCC) will be 8-byte aligned with C++Builder, 2-byte aligned with DMC, 8-byte aligned with
Visual C++, and 4-byte aligned with GCC.
•Any pointer (four bytes) will be 4-byte aligned. (e.g.: char*, int*)

The only notable differences in alignment for an 64-bit system when compared to a 32-bit system are:
•A long (eight bytes) will be 8-byte aligned.
•A double (eight bytes) will be 8-byte aligned.
•A long long (eight bytes) will be 8-byte aligned.
•A long double (eight bytes with Visual C++, sixteen bytes with GCC) will be 8-byte aligned with Visual C++ and 16-byte aligned with GCC.
•Any pointer (eight bytes) will be 8-byte aligned.

Rule: the structure should be a multiple of the largest alignment of any structure member
struct X {
char a[3];
Size 16
short int b;
4-bytes aligned
long int c;
char d[3];
};

After compilation the data structure will be supplemented with padding bytes to ensure a proper
alignment for each of its members

Padding is performing using array chars

struct X {
char a[3]; char p1[1];
short int b; char p2[2];
long int c;
char d[3]; char p3[1];
};
struct ex { Size 32
Size 8 //the size of the
struct ex string s1;
4-bytes aligned class instance and
{ }; its data members
float x;
char n[1]; struct X
}; Size 3
{
1-byte aligned
char c[3];
};

struct ex struct MixedData


{ Size 6 {
short s; 2-bytes aligned char Data1; Size 12
char n[3]; short Data2; 4-bytes aligned
}; int Data3;
char Data4;
};
struct ex Size 8
{ 4-bytes aligned
int b; struct X{ Size 16
char c[4]; char *c; 8-bytes aligned
}; float d; Ptr = 8 bytes
int x;
};
Hashing
Lec12- spring 2019
PRESENTER BY: DR EMAD NABIL
Lecture Overview

• Introduction to Hashing
• Hash functions
• Distribution of records among addresses
• synonyms and collisions
• Collision resolution by progressive overflow or linear probing
Motivation
• Hashing is a useful searching technique which can be used for implementing
indexes.

• The main motivation for Hashing is improving searching time.

• Below we show how the search time for Hashing compares to the one for
other methods
• using binary search O(log2 N)
• Hashing O(1)
What is Hashing ?
• The idea is to discover the location of a key by simply examining the
key.
• For that we need to design a hash function.

• A Hash Function is:


• a function h(k) that transforms a key into an address.

key f address
• An address space is chosen before hand.
• For example we may decide the file will have 1000 available addresses.

• If U is the set of all possible keys, the hash function is from U to


{0,1,...,999}
Address space
0 999
Hashing Example
Collision

• LOWELL, LOCK, OLIVER, and any word with first two letters L and O will be
mapped to the same address
h(LOWELL) = h(LOCK) = h(OLIVER) = 4
• These keys are called synonyms
• The address "4" is said to be the home address of any of these keys.

• Two different keys may be sent to the same address generating a


Collision
Collision resolution
• Avoiding collisions is extremely difficult
• Ways of reducing collisions
1. Spread out the records by choosing a good hash function

2. Use extra memory, i.e. increase the size of the address space.
ex: reserve 5000 available addresses rather than 1000

3. Put more than one record at a single address use of buckets


 Addresses generated from the key are uniformly and randomly distributed

 The hashing function must minimize the collision

8
1. Division Method
2. Multiplication Method
3. Extraction Method
4. Mid-Square Hashing
5. Folding Technique
6. Rotation
7. Universal Hashing

9
 One of the required features of the hash
function is that the resultant index must
be within the table index range

 One simple choice for a hash function is


to use the modulus division indicated as
MOD (the operator % in C/C++)

 The function returns an integer

 If any parameter is NULL, the result is


NULL
 Hash(Key) = Key % m m=307
m is the hash table size
10
 The multiplication method works as:
1. Multiply the key ‘Key’ K by a constant A in the range 0 < A < 1
2. extract the fractional part of k*A
3. Multiply this value by m and take the floor of the result.

where ( kA mod 1) denotes the fractional part of kA  kA−floor(kA) .


The optimal choice of A depends on the characteristics of the data being hashed. Knuth recommends

m is the hash table size


11
2. Multiplication Method Example
 When a portion of the key is used for the address calculation, the
technique is called as the extraction method

 In digit extraction, few digits are selected and extracted from the
key which are used as the address

Key Hashed Address


345678 357
234137 243
952671 927

13
 The mid-square hashing suggests to take square of the key and extract the middle digits of the
squared key as address
 The difficulty is when the key is large. As the entire key participates in the address calculation,
if the key is large, then it is very difficult to store the square of it as the square of key should not
exceed the storage limit
 So mid-square is used when the key size is less than or equal to 4 digits

Key Square Hashed Address


2341 5480281 802
1671 2792241 922

The difficulty of storing larger numbers square can be overcome if for squaring
we use few of digits of key instead of the whole key 14
We can select a portion of key if key is larger in size and then square the portion of it

Keys and addresses using extracting few digits, squaring them, and again extracting mid

Key Square Hashed


Address
234137 234 x 234 = 027889 788
567187 567 x 567 = 321489 148

15
5-

1
1

 the size of subparts of key could be as that of the address


5-Folding Technique
To compute this hash function apply 3 steps
• Step 1. Transform the key into a number
• Step 2. Fold and add and take the mod by a prime number

• Step 3. Divide by the size of the address space (preferably a prime number.)
• dividing by a number that has many small factors may result in lots of collisions.
 When keys are serial, they vary in only last digit and this leads to the creation of synonyms
 Rotating key would minimize this problem. This method is used along with other methods
 Here, the key is rotated right by one digit and then use of folding would avoid synonym

 For example, let the key be 120605, when it is rotated we get 512060
 Then further the address is calculated using any other hash function

19
Some Other Hashing Methods

• Radix Transformation
• Transform the number into another base and then divide
by the maximum address.
If Hash(Key1) = Hash(Key2)
then
Key1 and Key2
are

synonyms
and collision happens
Consider the hash value is the RRN and
we working on fixed length records 21
Distribution of Records among Addresses

• Uniform distributions are extremely rare.


• Random distributions are acceptable and more easily obtainable.
1. Open addressing
The first collision resolution method, open addressing, resolves collisions in the home area. When a collision
occurs, the home area addresses are searched for an open or unoccupied element where the new data can be
placed. Examples of Open Addressing Methods:
1.1. Linear probing or progressive overflow
1.2. Quadratic probing
1.3. Double hashing

2. Bucket hashing (defers collision but does not prevent it)


3. Separate chaining
4. Separate chaining with overflow area

23
1.1. open addressing
Progressive Overflow or linear probing

H = F(key)
is the home address. If it is available we store the record, otherwise, we increase H by k,
H = (H + k) mod tableSize, (k ≥1)
Collision Resolution: Progressive Overflow
Any
suggestion !!
Collision Resolution: Progressive Overflow
• Advantage
• Simplicity
• Disadvantage
• If there are lots of collisions, clusters of records can form as in the previous
example.
1.2. Quadratic Probing
H = F(key)
H= (H + i2 )% tablesize, i≥ 𝟏

• Quadratic Probe If there is a collision at hash address H,


• this method probes/explores the table at locations h+1, h+4, h+9, ...,
• that is, at locations H + i^2 (mod tablesize) for i = 1, 2, ....
• That is, the increment function is i^2.
• Quadratic probing substantially reduces clustering, but it will not probe/explore all
locations in the table.
1.3. Double Hashing

6+4 = 10
1.3. Double Hashing

H =F1(key)  to compute the home address


Step =F2(key)
Table size =M
H = (H + i*step)%M, i>= 0repeat this until we find a place or we find the start point again.

Double hashing represents an improvement over linear or quadratic probing

Double Hashing uses nonlinear probing by computing different probe increments for different keys.
It uses two functions.

The first function computes the original address, if the slot is available (or the record is found) we stop there,
otherwise, we apply the second hashing function to compute the step value.
1. Open addressing
The first collision resolution method, open addressing, resolves collisions in the home area. When a collision
occurs, the home area addresses are searched for an open or unoccupied element where the new data can be
placed. Examples of Open Addressing Methods:
1.1. Linear probing or progressive overflow
1.2. Quadratic probing
1.3. Double hashing

2. Bucket hashing (defers collision but does not prevent it)


3. Separate chaining
4. Separate chaining with overflow area

34
Hashing
part2
Lec13- spring 2018
PRESENTER BY: DR EMAD NABIL
Lecture Overview

• Introduction to Hashing
• Hash functions
• Distribution of records among addresses
• synonyms and collisions
• Collision resolution by progressive overflow or linear probing
1. Open addressing
The first collision resolution method, open addressing, resolves collisions in the home area. When a collision occurs,
the home area addresses are searched for an open or unoccupied element where the new data can be placed.
Examples of Open Addressing Methods:
1.1. Linear probing
1.2. Quadratic probing
1.3. Double hashing

2. Bucket hashing (defers collision but does not prevent it)


3. Chained progressive overflow
4. Chaining with separate overflow area
5. Hashed index (scatter table)

3
Address space
Hashing Example 0 999

Offset of Lowell =
4*Record_Size
2.Hashing with buckets

►This is a variation of hashed files in which more than one record/key is


stored per hash address.

►Bucket = block of records corresponding to one address in the hash table

►The hash function gives the Bucket Address


2. Hashing with buckets

Collision resolution with buckets and linear


probing (progressive overflow) method.
Note: A2=B2=C2
Implementation issues
• Mark free slots using certain marker

/// /// ///

• Mark deleted slots using tombstone for future usage

###

• In every bucket, there should be a counter that count the number of used
slots.

• Be careful of infinite loop if the whole table is full (buckets with progressive overflow)
1. Shifting records that follow a tombstone to its home address
2. Complete reorganization (Rehashing )
3. Use different collision resolution technique
3. Chained progressive overflow
Data Next

20
21
22
23
24
25
..
Data Next

20
21
22
23
24
25
..
Data Next

20
21
22
23
24
25
..
Data Next

20
21
22
23
24
25
..
Data Next

20
21
22
23
24
25
..
Data Next

20
21
22
23
24
25
..
The key behind enhancement:
grouping similar keys in one cluster

Cluster 1: (20) Adams Coles Flint


Cluster 2: (23) Bates Dean
A problem in the previous technique
Data Next

20
21
22
23
24
25
..
Data Next

20
21
22
23
24
25
..
Data Next

20
21
22
23
24
25
..
Data Next

20
21
22
23
24
25
..
Data Next

20
21
22
23
24
25
..

Did you notice the problem??


(20)Adams (20)Coles (22)Dean (20)Flint
(20)Adams (20)Coles (22)Dean (20)Flint

Two
clusters
The problem is Overlapped
that the home
address of dean,
which is the head
of a cluster, is not
free
4. Chaining with separate overflow area
Bucket size=1, may be
more than one slot in
bucket
5. Hashed index 5

Applicable
Hash value Record offset inside file to variable
length
records
Patterns of records access
1. Open addressing
The first collision resolution method, open addressing, resolves collisions in the home area. When a collision occurs,
the home area addresses are searched for an open or unoccupied element where the new data can be placed.
Examples of Open Addressing Methods:
1.1. Linear probing
1.2. Quadratic probing
1.3. Double hashing

2. Bucket hashing (defers collision but does not prevent it)


3. Chained progressive overflow
4. Chaining with separate overflow area
5. Hashed index (scatter table)

38
Introduction to Multilevel Indexing
Lec. 14
Dr. Emad Nabil
Spring 2019
Lecture Overview

• Introduction to multilevel indexing and B-trees


• Insertions in B-trees
Motivations
Problems with simple indexes that are kept in disk
(in case of index is very large to be loaded in memory)

• Seeking the index is still slow (binary searching)


• We don't want more than
3 or 4 seeks for a search
• So, here log2(N+1) is still slow

• Insertions and deletions should be as fast as searches


• In simple indexes, insertion or deletion take O(n) disk
accesses
(since index should be kept sorted) because of shifting
Solution 1

A sorted list can be expressed in a Binary Search Tree representation.


Binary search Tree
• we no longer have to sort index
• Worst case search with a balanced binary tree is log2 (N + 1) compares.
• Worst case search with an unbalanced binary tree is near to
O(N) compares.
Too much seeks
operation Average Worst case
Search O(log n) O(n)
Insert O(log n) O(n)
Delete O(log n) O(n)

Unbalanced BST balanced BST


Solution 2
AVL Trees (Georgy Adelson-Velsky and Evgenii Landis' tree, named after the inventors)

• The AVL tree is self-balancing binary search tree.

• In AVL tree: the heights of the two child subtrees of any node differ by at most
one

Still Too much seeks

Average Worst case


Search O(log n) O(log n)
Insert O(log n) O(log n)
Delete O(log n) O(log n)
Solution 3
Multilevel Indexing

• Build an index of an index file


• How
• Build a simple index for the file.
• Build an index for this index
• Build another index for the previous index, and so on.
Multilevel Indexing
offset
offset
offset
offset

offset
offset

offset
offset
offset
offset
Insert record with key
smaller than all keys

Shift all

While multi-record multi-level indexes really help reduce the number of disk accesses
and their overhead space costs are minimal, inserting a new key or deleting an old one
is very costly.
The problem of indexes in insertion and deletion (shifting)
Solution 4
B-Trees: An Overview

• B-trees do not need re-balancing (self-balancing tree)

• B-Trees are multi-level indexes that solve the


problem of linear cost of insertion and deletion.

• B-Trees are now the standard way to represent


indexes.
Formal Definition of B-Tree Properties ‫مهم جدا‬
• In a B-Tree of order m,
• Every page has a maximum of m descendants
• Every page, except for the root and leaves, has at
least m/2 descendants.
• The root has at least two descendants (unless it is a
leaf).
• All the leaves appear on the same level.
• The leaf level forms a complete, ordered index of
the associated data file.
• B tree is a self-balancing tree
Example of B tree of order (5)
Tree Definitions
1. Each node is reachable from
the root with a unique
sequence of arcs called path.
Level 1 Depth 0
2. The number of arcs in a path
is the length of the path.
Height= Level 2 Depth 1
3. The level of a node is the Max #edges=3
length of the path from the
root to the node + 1. Level 3
4. The height of a non-empty Depth 2
tree is the maximum depth of Path length =3
Level 4
a node in the tree.
Depth 3
5. The height of an empty tree
is zero.
Num of elelments
The worst case depth of B-tree
…… 2
Level= 1 2

m/2 m/2
… … Level= 2 2*m/2
… …
m/2 m/2 m/2 m/2
… …. … … …. … Level= 3
… … … … 2*m/2*m/2= 2*(m/2)2


..

Level= d
= 2*(m/2)d-1

The worst case depth occurs when every node contains has minimum number of children
A tree with N element where every node has minimum number of children max number of levels (d) =
B Tree
Lec 15
Dr Emad Nabil
Spring 2018
Formal Definition of B-Tree Properties
• In a B-Tree of order m,
• Every page has a maximum of m descendants
• Every page, except for the root and leaves, has at least 𝒎/𝟐
descendants.
• The root has at least two descendants (unless it is a leaf).
• All the leaves appear on the same level.
• The leaf level forms a complete, ordered index of the associated
data file.
• B-Tree is self-balancing
Insertion in B-Tree
• To insert a new element, search for the leaf node where the new element
should be added.

• If the node contains fewer elements than the maximum capacity. Insert this
element into it in order.

• If the node is full, split it into to nodes such that

o A single median is chosen from among the leaf's elements and the new element.

o Values less than or equal the median are put in the new left node and values greater
than the median are put in the new right node.

o The median value is inserted in the node's parent, which may cause it to be split, and
so on. If the node has no parent (i.e., the node was the root), create a new root
above this node (increasing the height of the tree).
Deletion in B-Tree
Deletion from a node
Search for the value to delete
If the value is in a leaf node, simply delete it from the node.
If underflow happens, re-balance the tree as described in "Re-balancing after
deletion".
Rebalance after deletion
o Let's call the node containing the deleted element as “target node”
o If the left sibling of the target node has more than the minimum number of
elements, move the largest element from the left sibling to the target node. Update
the parent if needed. The tree is now balanced.
o Else if the right sibling of the target node has more than the minimum number of
elements, do the above step with the right sibling. The tree is now balanced.

o If both left and right siblings contain exactly the minimum number of elements.
 If there is a left sibling, move all remaining elements from the target node to the left sibling.
The target node now is empty and the left sibling is full.
 Else, do the above step with the right sibling.
Notes
• If order in a B-tree is m then
• Min number of nodes in root = 2
• Min number of elements in any other node = 𝑚/2
• In case of overflow the you should propagate median to parent node, in his
case Median = (𝑚 + 1)/2
Example2: Insert the following
sequence into B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4

3 4 19 20

Insert 1

1 3 4 19 20

split

4 20

1 3 4 19 20
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11 4 20
Order m =4

1 3 4 19 20

4 20
Insert 13

1 3 4 13 19 20
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11 4 20
Order m =4

1 3 4 13 19 20

Insert 16

4 20

1 3 4 13 16 19 20
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4 4 20

1 3 4 13 16 19 20

Insert 9

4 16 20

1 3 4

9 13 16 19 20
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11 4 16 20

Order m =4

1 3 4

9 13 16 19 20
Insert 2

4 16 20

1 2 3 4

9 13 16 19 20
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4 4 16 20

1 2 3 4

Insert 23 19 20
9 13 16

4 16 23

1 2 3 4 Parent is
updated

9 13 16 19 20 23
4 16 23
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4
1 2 3 4

9 13 16 19 20 23
Insert 14

4 16 23

1 2 3 4

9 13 14 16 19 20 23
Insert the following numbers in B-tree 4 16 23
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4
1 2 3 4 19 20 23

9 13 14 16

Insert 7

4 13 16 23

1 2 3 4

19 20 23
division
7 9 13 14 16

7 9 13 14 16
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4

4 13 16 23

19 20 23
1 2 3 4

7 9 13 14 16

Insert 21

4 13 16 23

1 2 3 4
19 20 21 23

7 9 13 14 16
Insert the following numbers in 4 13 16 23
B-tree: 3,19, 4, 20, 1, 13, 16, 9,
2, 23, 14, 7, 21, 18, 11
Order m =4
1 2 3 4
19 20 21 23
7 9 13
14 16

Insert 18

4 13 16 23

20 18 19 20 21 23
1 2 3 4
division

18 19 20 21 23

7 9 13 14 16
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4

16 23

4 13 16 20 23

division

4 13 16 20 23
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4

16 23

4 13 16 20 23

1 2 3 4

18 19 20 21 23

7 9 13 14 16
Insert the following numbers in B-tree
3,19, 4, 20, 1, 13, 16, 9, 2, 23, 14, 7, 21, 18, 11
Order m =4

16 23

Insert 11 4 13 16 20 23

1 2 3 4

18 19 20 21 23

7 9 11 13 14 16
Deleting a node from B tree
16 23

Delete 21 4 13 16 20 23

1 2 3 4

18 19 20 21 23

7 9 11 13 14 16
16 23

Delete 21 4 13 16 20 23

1 2 3 4

18 19 20 21 23

7 9 11 13 14 16
Underflow,
borrow
element from
left sibling
Update
parent
16 23

Delete 21 4 13 16 19 23

1 2 3 4

18 19 20 23

7 9 11 13 14 16

DONE
16 23

Delete 20 4 13 16 19 23

1 2 3 4

18 19 20 23

Underflow, can’t
7 9 11 13 14 16 borrow element from
left sibling, there is no
right sibling, solution=>
merge with left sibling
Underflow ,
borrow from
left sibling
16 23

Delete 20 4 13 16 23

1 2 3 4

18 19 23

7 9 11 13 14 16
13 23

Delete 20 4 13 16 23

1 2 3 4

18 19 23

7 9 11 13 14 16

DONE
Parent is
updated Parent is
updated

23
13
19

Delete 23 4 13 23
16
19

1 2 3 4

18 19 23

7 9 11 13 14 16
Try this @ home
Insert these numbers into B-tree

3, 12, 5, 6, 17, 20, 21, 2, 14, 23, 8, 7

Oder = 3
Min number of nodes = 3/2 = 2
Median in case of overflow = 4/2 = 2

After completion of insertion delete nodes 23, 17, 14,3


Final solution after insertion

Final solution after deletion


implementation issues

The complete index


found in last level
An Advice from Thomas Cormen
co-author of in trodu ctio n to a lg orithm s
book
• Do not fool yourself into thinking you understand material that you do not
really and truly fully understand. That was my biggest mistake in my first
math course. Deep down you know whether you understand something. It
is easy to read something and tell yourself that, yeah, you understand
it...when you don’t.

• You were probably expecting me to say something about which courses to


take. OK, take the introductory CS courses that you need, some math
(especially linear algebra), and writing. You're not going to write only code
for the rest of your life. Learn how to communicate with yourself and
others.
Buffer Management
Lec#16
Presented by Dr. Emad Nabil
Spring 2018
Contents
• A journey of a byte
• Buffer Management
A journey of a byte
► Suppose in our program we wrote: outfile << c;
► This causes a call to the file manager
‫(ــ‬a part of O.S. responsible for I/O operations)
► The O/S (File manager) makes sure that the byte is written to the disk.
► Pieces of software/hardware involved in I/O:
1– Application Program
2– Operating System/ file manager
3– I/O Processor
4– Disk Controller
A journey of a byte
► Suppose in our program we wrote: outfile << c;

1–
2–file 3– I/O 4– Disk
Application
manager Processor Controller
Program
A journey of a byte
A journey of a byte
1►Application program
– Requests the I/O operation

2► File manager (Part of the OS that deals with I/O)


• Checks whether the operation is permitted
• Locates the physical location where the byte will be stored
(Drive, Cylinder, Track & Sector)
• Finds out whether the sector to locate the ‘P’ is already in memory
(if not, load it into memory I/O buffer)
• Puts ‘P’ in the I/O Buffer
• Keep the sector in memory to see if more bytes will be going to the
same sector in the file.
A journey of a byte
3► I/O Processor: a separate chip; runs independently of CPU
• Wait for an external data path to become available for data transfer
(CPU is faster than data-paths ==> Delays)

• I/O Processor asks the disk controller if the disk drive is available for
writing

4► Disk Controller: A separate chip; instructs the drive to move R/W head
Disk Controller instructs the disk drive to move its read/write head to the
right track and sector.

• Disk spins(rotates) to right location and byte is written


A journey of a byte-1

* File manager Locates the physical location


* File manager Puts ‘P’ in the I/O system’s output Buffer
A journey of a byte-2

• FM ask the I/O Processor to sore the data in the I/O buffer
• I/O Processor asks the disk controller if the disk drive is available for writing
• Disk Controller instructs the disk drive to move its read/write head to the right
track and sector.
Suppose in our program
we wrote: TEXT<< c;

will

A journey
of a byte
Buffer Management
► Buffering means working with large chunks of data in main memory so the
number of accesses to secondary storage is reduced.

►OS has  System I/O buffers. Managed by OS

► Application program has I/O program buffer, managed by application.


Buffer Management
Buffer Management while (1) {
infile >> ch; // 1
if (infile.fail()) break; // 2
outfile << ch; //3
}
• Suppose that the next character to be read from infile is physically stored in sector X
of the disk.
• Suppose that the place to write the next character to outfile is sector Y

With a single buffer


• When line 1 is executed sector X is brought to the buffer and ch receives the correct
character.
• When line 3 is executed sector Y must be brought to the buffer, and ch is deposited
to the right position.
• Now line 1 is executed again, Suppose we did not reach the end of sector X yet, Then
sector X must be brought again to the buffer so the current content of the buffer
must be written to sector Y before this is done. And so on.
Buffer Management
• This could be solved if there were more buffers available.

• Most operating systems have an input buffer and an output buffer.

• A new trip to get a sector of infile would only be done after all the
bytes in the previous sector had been read.
Buffering Strategies Suppose program needs to write some data on HD,
program fills buffer1 buff1 .written to HD
while writing buffer1 to HD,
1► Double Buffering program fills buffer2  buff1 .written to HD and so on.

• Two buffers can be used to allow processing and I/O to overlap.


• When both tasks are finished, the roles of the buffers can be exchanged.

Written to

Written to

Written to

Written to

Written to
Buffering Strategies

2►Buffer pooling:
– There is a pool of buffers.
– When a request for a sector is received, O.S. first looks to see that sector is in some
buffer.
– If not there, it brings the sector to some free buffer. If no free buffer exists, it must
choose an occupied buffer, free it, then reuse. usually LRU (least recently used )strategy
is used(replaces the page that hasn’t been referenced for the longest time). In other words, choose the buffer
that least used, free it, then reuse.
Buffering Strategies
Move Mode (using both system buffer & program buffer)
Situation in which data must be always moved from system buffer to
program buffer and vice versa, so that the OS can manipulate the
program data.
Application OS
Buffering Strategies
Locate Mode (using system buffer only or program buffer only to handle application I/O operations )
Ex: perform I/O directly between secondary storage and Example pf
using only
program buffer or system buffer system
buffers

Application OS

program buffer

You might also like