u4 - 5 i o Parallelism
u4 - 5 i o Parallelism
u4 - 5 i o Parallelism
Definition:
I/O parallelism reduces the time needed to retrieve relations from disk by
distributing (partitioning) the data across multiple disks.
Common Form:
Horizontal Partitioning: This involves dividing tuples of a relation
among several disks, ensuring each tuple resides on one disk.
Partitioning Techniques
Assuming there are n disks (D0, D1,..., Dn−1), three basic data-
partitioning strategies are:
Round-Robin Partitioning:
Process: Tuples are distributed to disks in a cyclic manner.
Method: The ith tuple goes to disk Di mod n.
Outcome: Ensures an even distribution of tuples across all disks, with
each disk having approximately the same number of tuples.
4
I/O Parallelism
Hash Partitioning:
Process: Uses one or more attributes of the relation as partitioning attributes.
Method: A hash function maps each tuple based on these attributes to a disk. If the
hash function returns i, the tuple is placed on disk Di.
Outcome: Provides a way to evenly distribute data based on the hash values.
Range Partitioning:
Process: Distributes tuples based on contiguous ranges of attribute values.
Method: A partitioning attribute A is chosen, along with a partitioning vector (e.g.,
[v0, v1,..., vn−2]).
If t[A]<v0, the tuple goes to disk D0.
If t[A]≥vn−2, it goes to disk Dn−1.
If vi≤t[A]<vi+1, it goes to disk Di+1.
6
Comparison of Partitioning Techniques
Efficiency of Partitioning Techniques:
Round-Robin:
Best for sequentially reading the entire relation.
Complicated for point and range queries since all disks must be searched.
Hash Partitioning:
Ideal for point queries based on the partitioning attribute (e.g., searching by telephone
number).
Allows quick access to the relevant disk, reducing the overhead of searching multiple disks.
Suitable for sequential scans if the hash function distributes tuples evenly.
Not effective for point queries on non-partitioning attributes or for range queries, as it
typically requires scanning all disks.
Range Partitioning:
Well-suited for both point and range queries on the partitioning attribute.
Quickly locates the relevant disk(s) using the partitioning vector, leading to higher
throughput and better response times.
Can encounter I/O bottlenecks (hot spots) if many tuples are in the queried range, causing
delays on a few disks.
Hash and round-robin partitioning can handle such queries faster as they distribute the load
across all disks.
7
Comparison of Partitioning Techniques
Impact on Other Operations:
The choice of partitioning technique also influences
relational operations like joins, making hash or range
partitioning generally preferable to round-robin
partitioning.
Partitioning Strategy Based on Relation Size:
For relations with few tuples, assign them to a single
disk.
For larger relations, distribute them across all available
disks.
If a relation has m disk blocks and n disks, allocate it
across min(m, n) disks for optimal performance.
8
Handling of Skew in Partitioning
Definition of Skew:
Skew occurs when tuples are unevenly distributed across partitions, leading to
some partitions containing many tuples and others containing few.
Types of Skew:
Attribute-value Skew: Occurs when certain values of the partitioning attribute
are common, causing all tuples with that value to cluster in a single partition.
Partition Skew: Refers to load imbalance in partitioning that can occur even
without attribute-value skew.
Impact of Skew:
Even a small amount of skew can significantly reduce performance, especially as
the degree of parallelism increases.
For example, if a relation with 1000 tuples is divided into 10 partitions and one
partition has 200 tuples, the expected speedup is 5 instead of 10. With 100
partitions, if one has 40 tuples, the speedup drops to 25 instead of 100.
9
Example : Attribute-value Skew
Consider a database of customer orders where the data is partitioned based on the "City"
attribute, so that each partition stores orders from a specific city. Here’s how attribute-value
skew might occur:
Data Distribution:
Let's say the database has customers from various cities, but a large proportion of customers come from
New York.
As a result, a high number of orders have City = 'New York'.
Impact of Skew:
When data is partitioned by the City attribute, the New York partition will hold an overwhelming
number of tuples (order records), whereas other partitions (for cities like Los Angeles or Chicago)
might hold significantly fewer tuples.
This creates an imbalance, with one partition (New York) being heavily loaded while others are
underutilized.
Consequences:
Performance bottleneck: The New York partition becomes slower to access and process because it has to
handle many more tuples than other partitions.
Storage inefficiency: This partition may require more storage space or processing resources, affecting
overall database efficiency.
This issue can be especially problematic in distributed databases or parallel processing systems,
where even distribution is crucial for optimal performance. To mitigate attribute-value skew,
one might use techniques such as range partitioning, hash partitioning, or adding a secondary
attribute for more balanced distribution.
10
Handling of Skew in Partitioning
Constructing a Balanced Range-partitioning Vector:
A balanced partitioning vector can be created by sorting the relation based on the
partitioning attributes and then evenly distributing the tuples across partitions.
This method incurs extra I/O overhead due to the initial sort.
Using Frequency Tables (Histograms):
To minimize I/O overhead, a histogram can be constructed that records the frequency of
attribute values.
A balanced range-partitioning function can be derived from this histogram, which occupies
little space and can be stored in the system catalog.
Histograms can be created by sampling the relation if not stored.
Virtual Processors Approach:
Introduce the concept of virtual processors, which are a greater number of logical processors
than the actual physical ones.
Any partitioning and query-evaluation techniques can map tuples to virtual processors.
Virtual processors are then distributed to real processors using round-robin allocation.
This method helps distribute the workload evenly across processors, mitigating the impact of
skew.
11