Brent Theorem

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

Brent's theorem

T = t + (m-t)/p
Brent's theorem specifies for a sequential algorithm with t time steps, and a total of m operations, that a run time T is definitely possible on a shared memory machine (PRAM) with p processors. There may be an algorithm that solves this problem faster, or it may be possible to implement this algorithm faster (by scheduling instruction differently to minimise idle processors, for instance), but it is definitely possible to implement this algorithm in this time given p processors. Key to understanding Brent's theorem is understanding time steps. In a single time step every instruction that has no dependencies is executed, and therefore t is equal to the length of the longest chain of instructions that depend on the results of other instructions (as any shorter chains will be finished executing by (or before) the time the longest chain has). Say we are summing an array. We could step through the array, adding each value in turn to an accumulator. In pseudocode:
for (i=0;i < length(a); i++) { sum = sum + a(i); }

Using this algorithm, each add operation depends on the result of the previous one, forming a chain of length n, thus t=n. There are n operations, so m = n. T = n + 0/p. (m-t) = 0, so no matter how many processors are available, this algorithm will take time n. Now consider the adding the array recursively: sum(a) = ( (A0 + A1) + (A2 + A3) ) + ( (A4 + A5) + (A6 + A7) ) etc... We can add A2 to A3 without needing to know what A0 + A1 is. To calculate the sum from 0-3, we only need the results of A0 + A1, and A2 + A3. Instead of one chain of length 4, we have many chains of length 2. For an array of length n, the longest chain(s) will be of length log(n). t = log(n). m = n (as before). T = log(n) + (n - log(n))/p. This tells us many useful things about the algorithm:

No matter how many processors used, there can be no implementation of this algorithm that runs faster than Olog(n). If we have n processors, the algorithm can be implemented in log(n) time

If we have log(n) processors, the algorithm can be implemented in 2log(n) time (asymptotically this is the same as log(n) time) If we have one processor, the algorithm can be implemented in n time.

If we consider the amount of work done in each case, with one processor, we do n work, with log(n) processors we do n work, but with n processors we do nlog(n) work. The implementations with 1 or log(n) processors, therefore are cost optimal, while the implementation with n processors is not. It is important to remember that Brent's theorem does not tell us how to implement any of these algorithms in parallel; it merely tells us what is possible. The Brent's theorem implementation may be hideously ugly compared to the naive implementation.
Brent's theorem and work efficiency Brent's theorem: any depth-d, size-N combinational circuit with bounded fan-in can be simulated by a p processor CREW algorithm in O(N/p + d) time Brent's theorem can be extended to EREW simulations when a combinational circuit has O(1) fan-out. unbounded fan-in can sometimes be handled by a CRCW simulation work efficient: if a p-processor PRAM algorithm A runs in time t, then for any p' < p, there is an p'processor PRAM algorithm A' for the same problem that runs in time O(pt/p') for p' = o(p), guarantees that there is a work-efficient algorithm; for p' = (p), no work-efficient algorithms exist Work-efficient parallel prefix computation A work-efficient randomized EREW parallel prefix algorithm uses (N/lg N) processors and runs in O(lg N) time with high probability

Definition
Assume a parallel computer where each processor can perform an arithmetic operation in unit time. Further, assume that the computer has exactly enough processors to exploit the maximum concurrency in an algorithm with N operations, such that T time steps suffice. Brent's Theorem says that a similar computer with fewer processors, P, can perform the algorithm in time

where P is less than or equal to the number of processors needed to exploit the maximum concurrency in the algorithm.

You might also like