0% found this document useful (0 votes)

301 views9 pages

Matrix_multiplication_algorithm

Uploaded by

Michael Margolese

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

301 views9 pages

Matrix_multiplication_algorithm

Uploaded by

Michael Margolese

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Matrix multiplication algorithm

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making
matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many
fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through
a graph.[1] Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel
and distributed systems, where the computational work is spread over multiple processors (perhaps over a network).

Directly applying the mathematical definition of matrix multiplication gives an algorithm that takes time on the order of n3 field
operations to multiply two n × n matrices over that field (Θ(n3) in big O notation). Better asymptotic bounds on the time required to
multiply matrices have been known since the Strassen's algorithm in the 1960s, but the optimal time (that is, the computational
complexity of matrix multiplication) remains unknown. As of April 2024, the best announced bound on the asymptotic complexity of
a matrix multiplication algorithm is O(n2.371552) time, given by Williams, Xu, Xu, and Zhou.[2] [3] This improves on the bound of
O(n2.3728596) time, given by Alman and Williams.[4][5] However, this algorithm is a galactic algorithm because of the large
constants and cannot be realized practically.

Iterative algorithm
The definition of matrix multiplication is that if C = AB for an n × m matrix A and an m × p matrix B, then C is an n × p matrix
with entries

From this, a simple algorithm can be constructed which loops over the indices i from 1 through n and j from 1 through p, computing
the above using a nested loop:

Input: matrices A and B

Let C be a new matrix of the appropriate size
For i from 1 to n:
For j from 1 to p:
Let sum = 0
For k from 1 to m:
Set sum ← sum + Aik × Bkj
Set Cij ← sum
Return C

This algorithm takes time Θ(nmp) (in asymptotic notation).[1] A common simplification for the purpose of algorithm analysis is to
assume that the inputs are all square matrices of size n × n, in which case the running time is Θ(n3), i.e., cubic in the size of the
dimension.[6]

Cache behavior
The three loops in iterative matrix multiplication can be arbitrarily swapped with each other without an effect on correctness or
asymptotic running time. However, the order can have a considerable impact on practical performance due to the memory access
patterns and cache use of the algorithm;[1] which order is best also depends on whether the matrices are stored in row-major order,
column-major order, or a mix of both.
In particular, in the idealized case of a fully associative cache consisting of M bytes and b bytes per
M
cache line (i.e. ⁠b ⁠ cache lines), the above algorithm is sub-optimal for A and B stored in row-major
M
order. When n > ⁠ ⁠, every iteration of the inner loop (a simultaneous sweep through a row of A and a
b
column of B) incurs a cache miss when accessing an element of B. This means that the algorithm
incurs Θ(n3) cache misses in the worst case. As of 2010, the speed of memories compared to that of
processors is such that the cache misses, rather than the actual calculations, dominate the running time
for sizable matrices.[7]

The optimal variant of the iterative algorithm for A and B in row-major layout is a tiled version,
where the matrix is implicitly divided into square tiles of size √M by √M :[7][8]
Illustration of row- and
column-major order
Input: matrices A and B
Let C be a new matrix of the appropriate size
Pick a tile size T = Θ(√M)
For I from 1 to n in steps of T:
For J from 1 to p in steps of T:
For K from 1 to m in steps of T:
Multiply AI:I+T, K:K+T and BK:K+T, J:J+T into CI:I+T, J:J+T, that is:
For i from I to min(I + T, n):
For j from J to min(J + T, p):
Let sum = 0
For k from K to min(K + T, m):
Set sum ← sum + Aik × Bkj
Set Cij ← Cij + sum

Return C

n3
In the idealized cache model, this algorithm incurs only Θ(⁠ ) cache misses; the divisor b √M amounts to several orders of
b √M ⁠
magnitude on modern machines, so that the actual calculations dominate the running time, rather than the cache misses.[7]

Divide-and-conquer algorithm
An alternative to the iterative algorithm is the divide-and-conquer algorithm for matrix multiplication. This relies on the block
partitioning

which works for all square matrices whose dimensions are powers of two, i.e., the shapes are 2n × 2n for some n. The matrix product
is now

which consists of eight multiplications of pairs of submatrices, followed by an addition step. The divide-and-conquer algorithm
computes the smaller multiplications recursively, using the scalar multiplication c11 = a11b11 as its base case.

The complexity of this algorithm as a function of n is given by the recurrence[6]

accounting for the eight recursive calls on matrices of size n/2 and Θ(n2) to sum the four pairs of resulting matrices element-wise.
Application of the master theorem for divide-and-conquer recurrences shows this recursion to have the solution Θ(n3), the same as
the iterative algorithm.[6]

Non-square matrices
A variant of this algorithm that works for matrices of arbitrary shapes and is faster in practice[7] splits matrices in two instead of four
submatrices, as follows.[9] Splitting a matrix now means dividing it into two parts of equal size, or as close to equal sizes as possible
in the case of odd dimensions.

Inputs: matrices A of size n × m, B of size m × p.

Base case: if max(n, m, p) is below some threshold, use an unrolled version of the iterative algorithm.
Recursive cases:

If max(n, m, p) = n, split A horizontally:

Else, if max(n, m, p) = p, split B vertically:

Otherwise, max(n, m, p) = m. Split A vertically and B horizontally:

Cache behavior
The cache miss rate of recursive matrix multiplication is the same as that of a tiled iterative version, but unlike that algorithm, the
recursive algorithm is cache-oblivious:[9] there is no tuning parameter required to get optimal cache performance, and it behaves well
in a multiprogramming environment where cache sizes are effectively dynamic due to other processes taking up cache space.[7] (The
simple iterative algorithm is cache-oblivious as well, but much slower in practice if the matrix layout is not adapted to the algorithm.)

The number of cache misses incurred by this algorithm, on a machine with M lines of ideal cache, each of size b bytes, is bounded
by[9]: 13

Sub-cubic algorithms
Algorithms exist that provide better running times than the
straightforward ones. The first to be discovered was Strassen's
algorithm, devised by Volker Strassen in 1969 and often referred
to as "fast matrix multiplication". It is based on a way of
multiplying two 2 × 2-matrices which require only 7
multiplications (instead of the usual 8), at the expense of several
additional addition and subtraction operations. Applying this
recursively gives an algorithm with a multiplicative cost of
. Strassen's algorithm is more complex,
and the numerical stability is reduced compared to the naïve
algorithm,[10] but it is faster in cases where n > 100 or so[1] and
appears in several libraries, such as BLAS.[11] It is very useful for Improvement of estimates of exponent ω over time for the
large matrices over exact domains such as finite fields, where computational complexity of matrix multiplication .
numerical stability is not an issue.
Since Strassen's algorithm is actually used in practical numerical software and computer algebra systems improving on the constants
hidden in the big O notation has its merits. A table that compares key aspects of the improved version based on recursive
multiplication of 2x2-block matrices via 7 block matrix multiplications follows. As usual gives the dimensions of the matrix and
designates the memory size.
NAIVE: n^3 multiplications, 2n^3 - n^2 arithmetic operations, winograd and naive crossover at N ~ 290
Progress for Strassen-like recursive 2x2-block matrix multiplication

#matrix #matrix
Year Reference multiplications additions total arithmetic operations total I/O-complexity
per step per step

1969 Strassen[12] 7 18

1971 Winograd[13] 7 15

Karstadt,
2017 7 12
Schwartz[14]

Schwartz,
2023 7 12
Vaknin[15]

It is known that a Strassen-like algorithm with a 2x2-block matrix step requires at least 7 block matrix multiplications. In 1976
Probert[16] showed that such an algorithm requires at least 15 additions (including subtractions), however, a hidden assumption was
that the blocks and the 2x2-block matrix are represented in the same basis. Karstadt and Schwartz computed in different bases and
traded 3 additions for less expensive basis transformations. They also proved that one cannot go below 12 additions per step using
different bases. In subsequent work Beniamini et el.[17] applied this base change trick to more general decompositions than 2x2-block
matrices and improved the leading constant for their run times.

It is an open question in theoretical computer science how well Strassen's algorithm can be improved in terms of asymptotic
complexity. The matrix multiplication exponent, usually denoted , is the smallest real number for which any matrix over a
field can be multiplied together using field operations. The current best bound on is , by Williams, Xu, Xu,
and Zhou.[2][4] This algorithm, like all other recent algorithms in this line of research, is a generalization of the Coppersmith–
Winograd algorithm, which was given by Don Coppersmith and Shmuel Winograd in 1990.[18] The conceptual idea of these
algorithms is similar to Strassen's algorithm: a way is devised for multiplying two k × k-matrices with fewer than k3 multiplications,
and this technique is applied recursively. However, the constant coefficient hidden by the Big O notation is so large that these
algorithms are only worthwhile for matrices that are too large to handle on present-day computers.[19][20]

Freivalds' algorithm is a simple Monte Carlo algorithm that, given matrices A, B and C, verifies in Θ(n2) time if AB = C.

AlphaTensor
In 2022, DeepMind introduced AlphaTensor, a neural network that used a single-player game analogy to invent thousands of matrix
multiplication algorithms, including some previously discovered by humans and some that were not.[21] Operations were restricted to
the non-commutative ground field (normal arithmetic) and finite field (mod 2 arithmetic). The best "practical" (explicit low-
rank decomposition of a matrix multiplication tensor) algorithm found ran in O(n2.778).[22] Finding low-rank decompositions of such
tensors (and beyond) is NP-hard; optimal multiplication even for 3x3 matrices remains unknown, even in commutative field.[22] On
4x4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required
with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved 5x5 matrices with 96 rather
than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a
similar independent 4x4 algorithm, and separately tweaked Deepmind's 96-step 5x5 algorithm down to 95 steps in mod 2 arithmetic
and to 97[23] in normal arithmetic.[24] Some algorithms were completely new: for example, (4, 5, 5) was improved to 76 steps from a
baseline of 80 in both normal and mod 2 arithmetic.

Parallel and distributed algorithms

Shared-memory parallelism
The divide-and-conquer algorithm sketched earlier can be parallelized in two ways for shared-memory multiprocessors. These are
based on the fact that the eight recursive matrix multiplications in

can be performed independently of each other, as can the four summations (although the algorithm needs to "join" the multiplications
before doing the summations). Exploiting the full parallelism of the problem, one obtains an algorithm that can be expressed in fork–
join style pseudocode:[25]

Procedure multiply(C, A, B):

Base case: if n = 1, set c11 ← a11 × b11 (or multiply a small block matrix).
Otherwise, allocate space for a new matrix T of shape n × n, then:
Partition A into A11, A12, A21, A22.
Partition B into B11, B12, B21, B22.
Partition C into C11, C12, C21, C22.
Partition T into T11, T12, T21, T22.
Parallel execution:
Fork multiply(C11, A11, B11).
Fork multiply(C12, A11, B12).
Fork multiply(C21, A21, B11).
Fork multiply(C22, A21, B12).
Fork multiply(T11, A12, B21).
Fork multiply(T12, A12, B22).
Fork multiply(T21, A22, B21).
Fork multiply(T22, A22, B22).
Join (wait for parallel forks to complete).
add(C, T).
Deallocate T.
Procedure add(C, T) adds T into C, element-wise:

Base case: if n = 1, set c11 ← c11 + t11 (or do a short loop, perhaps unrolled).
Otherwise:
Partition C into C11, C12, C21, C22.
Partition T into T11, T12, T21, T22.
In parallel:
Fork add(C11, T11).
Fork add(C12, T12).
Fork add(C21, T21).
Fork add(C22, T22).
Join.

Here, fork is a keyword that signal a computation may be run in parallel with the rest of the function call, while join waits for all
previously "forked" computations to complete. partition achieves its goal by pointer manipulation only.
This algorithm has a critical path length of Θ(log2 n) steps, meaning it takes that much time on an ideal machine with an infinite
number of processors; therefore, it has a maximum possible speedup of Θ(n3/log2 n) on any real computer. The algorithm isn't
practical due to the communication cost inherent in moving data to and from the temporary matrix T, but a more practical variant
achieves Θ(n2) speedup, without using a temporary matrix.[25]

Communication-avoiding and distributed algorithms

On modern architectures with hierarchical memory, the cost of loading and storing input
matrix elements tends to dominate the cost of arithmetic. On a single machine this is the
amount of data transferred between RAM and cache, while on a distributed memory multi-
node machine it is the amount transferred between nodes; in either case it is called the
communication bandwidth. The naïve algorithm using three nested loops uses Ω(n3)
communication bandwidth.

Cannon's algorithm, also known as the 2D algorithm, is a communication-avoiding algorithm

that partitions each input matrix into a block matrix whose elements are submatrices of size
√M/3 by √M/3 , where M is the size of fast memory.[26] The naïve algorithm is then used
over the block matrices, computing products of submatrices entirely in fast memory. This
reduces communication bandwidth to O(n3/√M), which is asymptotically optimal (for
algorithms performing Ω(n3) computation).[27][28]

In a distributed setting with p processors arranged in a √p by √p 2D mesh, one submatrix of

the result can be assigned to each processor, and the product can be computed with each
processor transmitting O(n2/√p) words, which is asymptotically optimal assuming that each
node stores the minimum O(n2/p) elements.[28] This can be improved by the 3D algorithm, Block matrix multiplication. In the 2D
which arranges the processors in a 3D cube mesh, assigning every product of two input algorithm, each processor is
responsible for one submatrix of C.
submatrices to a single processor. The result submatrices are then generated by performing a
In the 3D algorithm, every pair of
reduction over each row.[29] This algorithm transmits O(n2/p2/3) words per processor, which submatrices from A and B that is
is asymptotically optimal.[28] However, this requires replicating each input matrix element multiplied is assigned to one
p1/3 times, and so requires a factor of p1/3 more memory than is needed to store the inputs. processor.
This algorithm can be combined with Strassen to further reduce runtime.[29] "2.5D" algorithms
provide a continuous tradeoff between memory usage and communication bandwidth.[30] On
modern distributed computing environments such as MapReduce, specialized multiplication algorithms have been developed.[31]

Algorithms for meshes

There are a variety of algorithms for multiplication on meshes. For multiplication of two
n×n on a standard two-dimensional mesh using the 2D Cannon's algorithm, one can
complete the multiplication in 3n-2 steps although this is reduced to half this number for
repeated computations.[32] The standard array is inefficient because the data from the two
matrices does not arrive simultaneously and it must be padded with zeroes.

The result is even faster on a two-layered cross-wired mesh, where only 2n-1 steps are
needed.[33] The performance improves further for repeated computations leading to 100%
efficiency.[34] The cross-wired mesh array may be seen as a special case of a non-planar
(i.e. multilayered) processing structure.[35]

See also
Computational complexity of mathematical operations
Computational complexity of matrix multiplication Matrix multiplication completed in 2n-1
CYK algorithm § Valiant's algorithm steps for two n×n matrices on a cross-
Matrix chain multiplication wired mesh.
Method of Four Russians
Multiplication algorithm
Sparse matrix–vector multiplication

References
1. Skiena, Steven (2012). "Sorting and Searching". The Algorithm Design Manual (https://archive.org/details/algorithmde
signm00skie_772). Springer. pp. 45 (https://archive.org/details/algorithmdesignm00skie_772/page/n56)–46, 401–3.
doi:10.1007/978-1-84800-070-4_4 (https://doi.org/10.1007%2F978-1-84800-070-4_4). ISBN 978-1-84800-069-8.
2. Williams, Virginia Vassilevska; Xu, Yinzhan; Xu, Zixuan; Zhou, Renfei (2024), New Bounds for Matrix Multiplication:
from Alpha to Omega, arXiv:2307.07970 (https://arxiv.org/abs/2307.07970)
3. Duan, Ran; Wu, Hongxun; Zhou, Renfei (2022), Faster Matrix Multiplication via Asymmetric Hashing,
arXiv:2210.10173 (https://arxiv.org/abs/2210.10173)
4. Alman, Josh; Williams, Virginia Vassilevska (2024), "A Refined Laser Method and Faster Matrix Multiplication" (https://
www.siam.org/conferences/cm/program/accepted-papers/soda21-accepted-papers), Theoretics, arXiv:2010.05846 (ht
tps://arxiv.org/abs/2010.05846), doi:10.46298/theoretics.24.21 (https://doi.org/10.46298%2Ftheoretics.24.21)
5. Hartnett, Kevin (23 March 2021). "Matrix Multiplication Inches Closer to Mythic Goal" (https://www.quantamagazine.or
g/mathematicians-inch-closer-to-matrix-multiplication-goal-20210323/). Quanta Magazine. Retrieved 2021-04-01.
6. Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms
(3rd ed.). MIT Press and McGraw-Hill. pp. 75–79. ISBN 0-262-03384-4.
7. Amarasinghe, Saman; Leiserson, Charles (2010). "6.172 Performance Engineering of Software Systems, Lecture 8"
(http://aka-ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-softw
are-systems-fall-2010/video-lectures/lecture-8-cache-efficient-algorithms/). MIT OpenCourseWare. Massachusetts
Institute of Technology. Retrieved 27 January 2015.
8. Lam, Monica S.; Rothberg, Edward E.; Wolf, Michael E. (1991). The Cache Performance and Optimizations of
Blocked Algorithms. ASPLOS91: 4th Int'l Conference on Architecture Support for Programming Languages &
Operating Systems. doi:10.1145/106972.106981 (https://doi.org/10.1145%2F106972.106981). ISBN 978-0-89791-
380-5.
9. Prokop, Harald (1999). Cache-Oblivious Algorithms (http://supertech.csail.mit.edu/papers/Prokop99.pdf) (PDF)
(Master's). MIT. hdl:1721.1/80568 (https://hdl.handle.net/1721.1%2F80568).
10. Miller, Webb (1975), "Computational complexity and numerical stability", SIAM News, 4 (2): 97–107,
CiteSeerX 10.1.1.148.9947 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.148.9947),
doi:10.1137/0204009 (https://doi.org/10.1137%2F0204009)
11. Press, William H.; Flannery, Brian P.; Teukolsky, Saul A.; Vetterling, William T. (2007). Numerical Recipes: The Art of
Scientific Computing (3rd ed.). Cambridge University Press. p. 108 (https://archive.org/details/numericalrecipes00pres
_033/page/n131). ISBN 978-0-521-88068-8.
12. Strassen, Volker (1969). "Gaussian Elimination is not Optimal". Numer. Math. 13 (4): 354–356.
doi:10.1007/BF02165411 (https://doi.org/10.1007%2FBF02165411). S2CID 121656251 (https://api.semanticscholar.o
rg/CorpusID:121656251).
13. Winograd, Shmuel (1971). "On multiplication of 2×2 matrices" (https://doi.org/10.1016%2F0024-3795%2871%299000
9-7). Linear Algebra and Its Applications. 4 (4): 381–388. doi:10.1016/0024-3795(71)90009-7 (https://doi.org/10.101
6%2F0024-3795%2871%2990009-7).
14. Karstadt, Elaye; Schwartz, Oded (July 2017). "Matrix Multiplication, a Little Faster" (https://dl.acm.org/doi/10.1145/308
7556.3087579). Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures. SPAA '17.
pp. 101–110. doi:10.1145/3087556.3087579 (https://doi.org/10.1145%2F3087556.3087579).
15. Schwartz, Oded; Vaknin, Noa (2023). "Pebbling Game and Alternative Basis for High Performance Matrix
Multiplication" (https://doi.org/10.1137/22M1502719). SIAM Journal on Scientific Computing. pp. C277–C303.
doi:10.1137/22M1502719 (https://doi.org/10.1137%2F22M1502719).
16. Probert, Robert L. (1976). "On the additive complexity of matrix multiplication". SIAM J. Comput. 5 (2): 187–203.
doi:10.1137/0205016 (https://doi.org/10.1137%2F0205016).
17. Beniamini, Gal; Cheng, Nathan; Holtz, Olga; Karstadt, Elaye; Schwartz, Oded (2020). "Sparsifying the Operators of
Fast Matrix Multiplication Algorithms". arXiv:2008.03759 (https://arxiv.org/abs/2008.03759) [cs.DS (https://arxiv.org/ar
chive/cs.DS)].
18. Coppersmith, Don; Winograd, Shmuel (1990), "Matrix multiplication via arithmetic progressions" (http://www.cs.umd.e
du/~gasarch/TOPICS/ramsey/matrixmult.pdf) (PDF), Journal of Symbolic Computation, 9 (3): 251,
doi:10.1016/S0747-7171(08)80013-2 (https://doi.org/10.1016%2FS0747-7171%2808%2980013-2)
19. Iliopoulos, Costas S. (1989), "Worst-case complexity bounds on algorithms for computing the canonical structure of
finite abelian groups and the Hermite and Smith normal forms of an integer matrix" (https://web.archive.org/web/2014
0305182030/http://www.williamstein.org/home/wstein/www/home/pernet/Papers/Hermite/Iliopoulos88.pdf) (PDF),
SIAM Journal on Computing, 18 (4): 658–669, CiteSeerX 10.1.1.531.9309 (https://citeseerx.ist.psu.edu/viewdoc/sum
mary?doi=10.1.1.531.9309), doi:10.1137/0218045 (https://doi.org/10.1137%2F0218045), MR 1004789 (https://maths
cinet.ams.org/mathscinet-getitem?mr=1004789), archived from the original (http://www.williamstein.org/home/wstein/
www/home/pernet/Papers/Hermite/Iliopoulos88.pdf) (PDF) on 2014-03-05, retrieved 2015-01-16, "The Coppersmith–
Winograd algorithm is not practical, due to the very large hidden constant in the upper bound on the number of
multiplications required."
20. Robinson, Sara (November 2005), "Toward an Optimal Algorithm for Matrix Multiplication" (https://archive.siam.org/pd
f/news/174.pdf) (PDF), SIAM News, 38 (9), "Even if someone manages to prove one of the conjectures—thereby
demonstrating that ω = 2—the wreath product approach is unlikely to be applicable to the large matrix problems that
arise in practice. [...] the input matrices must be astronomically large for the difference in time to be apparent."
21. "Discovering novel algorithms with AlphaTensor" (https://www.deepmind.com/blog/discovering-novel-algorithms-with-a
lphatensor). www.deepmind.com. 5 October 2022. Retrieved 2022-11-01.
22. Fawzi, Alhussein; Balog, Matej; Huang, Aja; Hubert, Thomas; Romera-Paredes, Bernardino; Barekatain,
Mohammadamin; Novikov, Alexander; R. Ruiz, Francisco J.; Schrittwieser, Julian; Swirszcz, Grzegorz; Silver, David;
Hassabis, Demis; Kohli, Pushmeet (October 2022). "Discovering faster matrix multiplication algorithms with
reinforcement learning" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9534758). Nature. 610 (7930): 47–53.
Bibcode:2022Natur.610...47F (https://ui.adsabs.harvard.edu/abs/2022Natur.610...47F). doi:10.1038/s41586-022-
05172-4 (https://doi.org/10.1038%2Fs41586-022-05172-4). ISSN 1476-4687 (https://search.worldcat.org/issn/1476-4
687). PMC 9534758 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9534758). PMID 36198780 (https://pubmed.ncbi.
nlm.nih.gov/36198780).
23. Kauers, Manuel; Moosbauer, Jakob (2022-12-02). "Flip Graphs for Matrix Multiplication". arXiv:2212.01175 (https://ar
xiv.org/abs/2212.01175) [cs.SC (https://arxiv.org/archive/cs.SC)].
24. Brubaker, Ben (23 November 2022). "AI Reveals New Possibilities in Matrix Multiplication" (https://www.quantamagazi
ne.org/ai-reveals-new-possibilities-in-matrix-multiplication-20221123/). Quanta Magazine. Retrieved 26 November
2022.
25. Randall, Keith H. (1998). Cilk: Efficient Multithreaded Computing (http://supertech.csail.mit.edu/papers/randall-phdthe
sis.pdf) (PDF) (Ph.D.). Massachusetts Institute of Technology. pp. 54–57. hdl:1721.1/47519 (https://hdl.handle.net/17
21.1%2F47519).
26. Cannon, Lynn Elliot (14 July 1969). A cellular computer to implement the Kalman Filter Algorithm (https://dl.acm.org/d
oi/abs/10.5555/905686) (Ph.D.). Montana State University.
27. Hong, J. W.; Kung, H. T. (1981). "I/O complexity: The red-blue pebble game" (https://apps.dtic.mil/dtic/tr/fulltext/u2/a10
4739.pdf) (PDF). Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81. pp. 326–
333. doi:10.1145/800076.802486 (https://doi.org/10.1145%2F800076.802486). S2CID 8410593 (https://api.semantics
cholar.org/CorpusID:8410593). Archived (https://web.archive.org/web/20191215182754/https://apps.dtic.mil/dtic/tr/fullt
ext/u2/a104739.pdf) (PDF) from the original on December 15, 2019.
28. Irony, Dror; Toledo, Sivan; Tiskin, Alexander (September 2004). "Communication lower bounds for distributed-memory
matrix multiplication". J. Parallel Distrib. Comput. 64 (9): 1017–26. CiteSeerX 10.1.1.20.7034 (https://citeseerx.ist.psu.
edu/viewdoc/summary?doi=10.1.1.20.7034). doi:10.1016/j.jpdc.2004.03.021 (https://doi.org/10.1016%2Fj.jpdc.2004.0
3.021).
29. Agarwal, R.C.; Balle, S. M.; Gustavson, F. G.; Joshi, M.; Palkar, P. (September 1995). "A three-dimensional approach
to parallel matrix multiplication". IBM J. Res. Dev. 39 (5): 575–582. CiteSeerX 10.1.1.44.3404 (https://citeseerx.ist.ps
u.edu/viewdoc/summary?doi=10.1.1.44.3404). doi:10.1147/rd.395.0575 (https://doi.org/10.1147%2Frd.395.0575).
30. Solomonik, Edgar; Demmel, James (2011). "Communication-optimal parallel 2.5D matrix multiplication and LU
factorization algorithms" (https://solomonik.cs.illinois.edu/talks/europar-sep-2011.pdf) (PDF). Proceedings of the 17th
International Conference on Parallel Processing. Vol. Part II. pp. 90–109. doi:10.1007/978-3-642-23397-5_10 (https://
doi.org/10.1007%2F978-3-642-23397-5_10). ISBN 978-3-642-23397-5.
31. Bosagh Zadeh, Reza; Carlsson, Gunnar (2013). "Dimension Independent Matrix Square Using MapReduce" (https://s
tanford.edu/~rezab/papers/dimsum.pdf) (PDF). arXiv:1304.1467 (https://arxiv.org/abs/1304.1467).
Bibcode:2013arXiv1304.1467B (https://ui.adsabs.harvard.edu/abs/2013arXiv1304.1467B). Retrieved 12 July 2014.
32. Bae, S.E.; Shinn, T.-W.; Takaoka, T. (2014). "A faster parallel algorithm for matrix multiplication on a mesh array" (http
s://www.sciencedirect.com/science/article/pii/S1877050914003858/pdf). Procedia Computer Science. 29: 2230–40.
doi:10.1016/j.procs.2014.05.208 (https://doi.org/10.1016%2Fj.procs.2014.05.208).
33. Kak, S (1988). "A two-layered mesh array for matrix multiplication". Parallel Computing. 6 (3): 383–5.
CiteSeerX 10.1.1.88.8527 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.88.8527). doi:10.1016/0167-
8191(88)90078-6 (https://doi.org/10.1016%2F0167-8191%2888%2990078-6).
34. Kak, S. (2014). "Efficiency of matrix multiplication on the cross-wired mesh array". arXiv:1411.3273 (https://arxiv.org/a
bs/1411.3273) [cs.DC (https://arxiv.org/archive/cs.DC)].
35. Kak, S (1988). "Multilayered array computing". Information Sciences. 45 (3): 347–365. CiteSeerX 10.1.1.90.4753 (http
s://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.90.4753). doi:10.1016/0020-0255(88)90010-2 (https://doi.org/1
0.1016%2F0020-0255%2888%2990010-2).

Further reading
Buttari, Alfredo; Langou, Julien; Kurzak, Jakub; Dongarra, Jack (2009). "A class of parallel tiled linear algebra
algorithms for multicore architectures". Parallel Computing. 35: 38–53. arXiv:0709.1272 (https://arxiv.org/abs/0709.12
72). doi:10.1016/j.parco.2008.10.002 (https://doi.org/10.1016%2Fj.parco.2008.10.002). S2CID 955 (https://api.seman
ticscholar.org/CorpusID:955).
Goto, Kazushige; van de Geijn, Robert A. (2008). "Anatomy of high-performance matrix multiplication". ACM
Transactions on Mathematical Software. 34 (3): 1–25. CiteSeerX 10.1.1.140.3583 (https://citeseerx.ist.psu.edu/viewd
oc/summary?doi=10.1.1.140.3583). doi:10.1145/1356052.1356053 (https://doi.org/10.1145%2F1356052.1356053).
S2CID 9359223 (https://api.semanticscholar.org/CorpusID:9359223).
Van Zee, Field G.; van de Geijn, Robert A. (2015). "BLIS: A Framework for Rapidly Instantiating BLAS Functionality".
ACM Transactions on Mathematical Software. 41 (3): 1–33. doi:10.1145/2764454 (https://doi.org/10.1145%2F276445
4). S2CID 1242360 (https://api.semanticscholar.org/CorpusID:1242360).
How To Optimize GEMM (https://github.com/flame/how-to-optimize-gemm)

Retrieved from "https://en.wikipedia.org/w/index.php?title=Matrix_multiplication_algorithm&oldid=1256943064"

Chapter 3 - Introduction To Digital Systems
No ratings yet
Chapter 3 - Introduction To Digital Systems
43 pages
Algorithm Exam Questions
No ratings yet
Algorithm Exam Questions
9 pages
EXERCISE 3 (Done)
No ratings yet
EXERCISE 3 (Done)
5 pages
Gaussian Quadrature
No ratings yet
Gaussian Quadrature
10 pages
matrix_mul
No ratings yet
matrix_mul
33 pages
Analog Engineers Circuit Cookbook - Data Converters (2nd Edition) PDF
100% (2)
Analog Engineers Circuit Cookbook - Data Converters (2nd Edition) PDF
303 pages
Chapter 2
No ratings yet
Chapter 2
26 pages
Strassen's Matrix Multiplication
No ratings yet
Strassen's Matrix Multiplication
10 pages
Lecture 11 CNF 19052023 075132am
No ratings yet
Lecture 11 CNF 19052023 075132am
45 pages
PROGRAMMING
No ratings yet
PROGRAMMING
26 pages
01-Propositional Logic
No ratings yet
01-Propositional Logic
40 pages
The Right To Remain Silent - A New Answer To An Old Question by James Duane
No ratings yet
The Right To Remain Silent - A New Answer To An Old Question by James Duane
3 pages
Automating Program Compilation - Writing Makefiles PDF
No ratings yet
Automating Program Compilation - Writing Makefiles PDF
6 pages
Python Lecture 9 Python List Vector Matrix
No ratings yet
Python Lecture 9 Python List Vector Matrix
11 pages
e363934f-7f60-4c34-b1fa-fa1936976d49
No ratings yet
e363934f-7f60-4c34-b1fa-fa1936976d49
8 pages
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
55 pages
Lecture 2 - Clustering Methods
No ratings yet
Lecture 2 - Clustering Methods
19 pages
RecordPgm2(B)
No ratings yet
RecordPgm2(B)
16 pages
Matrix Multiplication-Javan.
No ratings yet
Matrix Multiplication-Javan.
6 pages
DFS With Example
No ratings yet
DFS With Example
8 pages
639401 Design and Analysis of Algorithms MAY 2024
No ratings yet
639401 Design and Analysis of Algorithms MAY 2024
2 pages
1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci
No ratings yet
1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci
8 pages
Strassen's Matrix Multiplication
100% (5)
Strassen's Matrix Multiplication
11 pages
strassen matrix DAA
No ratings yet
strassen matrix DAA
14 pages
Internet of Things A Survey On Enabling PDF
No ratings yet
Internet of Things A Survey On Enabling PDF
30 pages
Nano VNA v2 - 2
No ratings yet
Nano VNA v2 - 2
1 page
Matrix Multiplication Algorithms With Better Time Complexity
No ratings yet
Matrix Multiplication Algorithms With Better Time Complexity
9 pages
CSE548-lecture-3
No ratings yet
CSE548-lecture-3
28 pages
Ai Lab 7
No ratings yet
Ai Lab 7
3 pages
Strassen Matrix Multiplication: Under The Guidance of
No ratings yet
Strassen Matrix Multiplication: Under The Guidance of
10 pages
MCSL 216 Journal Section 1 Complete
No ratings yet
MCSL 216 Journal Section 1 Complete
22 pages
Unit-1 Matrix Multiplication
No ratings yet
Unit-1 Matrix Multiplication
18 pages
Numerical Implementation of The Hilbert Transform - XianglingWang
No ratings yet
Numerical Implementation of The Hilbert Transform - XianglingWang
147 pages
Some New Kinds of Connected Domination in Fuzzy Graphs
100% (1)
Some New Kinds of Connected Domination in Fuzzy Graphs
11 pages
exercises---recursion
No ratings yet
exercises---recursion
4 pages
Arrays and Pointers in C: Alan L. Cox Alc@rice - Edu
No ratings yet
Arrays and Pointers in C: Alan L. Cox Alc@rice - Edu
26 pages
4 Conversion of Moore Machine to Mealy Machine
No ratings yet
4 Conversion of Moore Machine to Mealy Machine
19 pages
High Performance Computing Matrix Mul.
No ratings yet
High Performance Computing Matrix Mul.
15 pages
Dynamic Programming
No ratings yet
Dynamic Programming
15 pages
Group a Assign 3 Matrix Opearations
No ratings yet
Group a Assign 3 Matrix Opearations
6 pages
Practice Sheet 4 2016 Practice Sheet 4 2016
No ratings yet
Practice Sheet 4 2016 Practice Sheet 4 2016
3 pages
Irjet V5i3362
No ratings yet
Irjet V5i3362
4 pages
Nanowire electrodes for high-density stimulation and measurement of neural circuits
No ratings yet
Nanowire electrodes for high-density stimulation and measurement of neural circuits
5 pages
Matrix Multiplication1
No ratings yet
Matrix Multiplication1
10 pages
Matrix Chain Multiplication
No ratings yet
Matrix Chain Multiplication
10 pages
LUFA Library_ VID and PID values
No ratings yet
LUFA Library_ VID and PID values
2 pages
Prime Number Generator - Py
No ratings yet
Prime Number Generator - Py
3 pages
LEC 7 and 8-5
No ratings yet
LEC 7 and 8-5
41 pages
Strassian Matrix
No ratings yet
Strassian Matrix
18 pages
Chain of Matrix Multiplication Example
No ratings yet
Chain of Matrix Multiplication Example
5 pages
Everything You Need To Know About Semaphores And Mutexes
No ratings yet
Everything You Need To Know About Semaphores And Mutexes
5 pages
Blocked Matrix Multiply
No ratings yet
Blocked Matrix Multiply
6 pages
daa_02_R1_2_8ab4cb757311e510166decd2a9d9c328
No ratings yet
daa_02_R1_2_8ab4cb757311e510166decd2a9d9c328
63 pages
Bairstow Method1
No ratings yet
Bairstow Method1
8 pages
What Are P, NP, NP-complete, and NP-hard? P
No ratings yet
What Are P, NP, NP-complete, and NP-hard? P
1 page
How To Estimate Motor Regeneration and VM Pumping - ssztch8
No ratings yet
How To Estimate Motor Regeneration and VM Pumping - ssztch8
7 pages
Lecture 20ppt
No ratings yet
Lecture 20ppt
25 pages
Operator Precedence Table
No ratings yet
Operator Precedence Table
5 pages
Lecture 19ppt
No ratings yet
Lecture 19ppt
18 pages
Data Structure N Algorithm
No ratings yet
Data Structure N Algorithm
2 pages
EQTVuAhSFY (Dragged) 4
No ratings yet
EQTVuAhSFY (Dragged) 4
1 page
Fast Fourier Transform (FFT) FAQ
No ratings yet
Fast Fourier Transform (FFT) FAQ
4 pages
Digital Signal Processing (EEE324) : Lab Instructor Engr. Muhammad Ayaz
No ratings yet
Digital Signal Processing (EEE324) : Lab Instructor Engr. Muhammad Ayaz
12 pages
What Are Derivatives of Displacement
No ratings yet
What Are Derivatives of Displacement
10 pages
DAA IA-1 Case Study Material-CSE
No ratings yet
DAA IA-1 Case Study Material-CSE
9 pages
Strassen Algorithm
No ratings yet
Strassen Algorithm
4 pages
Pratt Chapter 1
No ratings yet
Pratt Chapter 1
11 pages
A Derivation of The EM Updates For Finding The Maximum Likelihood Parameter Estimates of The Student's T Distribution
No ratings yet
A Derivation of The EM Updates For Finding The Maximum Likelihood Parameter Estimates of The Student's T Distribution
5 pages
Conversions Question 1
No ratings yet
Conversions Question 1
13 pages
All Bank Po and Clerk Reasoning 1
No ratings yet
All Bank Po and Clerk Reasoning 1
19 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
Strassen
No ratings yet
Strassen
11 pages
Strassen
No ratings yet
Strassen
11 pages
JEST 2019
No ratings yet
JEST 2019
3 pages
Cannon Strassen DNS Algorithm
No ratings yet
Cannon Strassen DNS Algorithm
10 pages
Copy of Algo_Presentation
No ratings yet
Copy of Algo_Presentation
20 pages
Computing The Hilbert Transform and Its Inverse PDF
No ratings yet
Computing The Hilbert Transform and Its Inverse PDF
28 pages
Strassen
No ratings yet
Strassen
11 pages
Matrix Mult 09
No ratings yet
Matrix Mult 09
12 pages
Performing Worst-Case Circuit Analysis With LTspice - Technical Articles
No ratings yet
Performing Worst-Case Circuit Analysis With LTspice - Technical Articles
10 pages
Sadcx
No ratings yet
Sadcx
19 pages
Fault Detection and Correction: STLD Case Study
No ratings yet
Fault Detection and Correction: STLD Case Study
8 pages
Tiny Encryption Algorithm (TEA) For The Compact Framework: Download Source Files - 96.1 KB
No ratings yet
Tiny Encryption Algorithm (TEA) For The Compact Framework: Download Source Files - 96.1 KB
5 pages
Lab3
No ratings yet
Lab3
11 pages
Unit-5 21CSC201J
No ratings yet
Unit-5 21CSC201J
23 pages
Aaa 9
No ratings yet
Aaa 9
54 pages
FALLSEM2023-24 CSE2012 ETH VL2023240103657 2023-10-06 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSE2012 ETH VL2023240103657 2023-10-06 Reference-Material-I
11 pages
Strassens Matrix Multiflication
No ratings yet
Strassens Matrix Multiflication
14 pages
Algo VC Lecture24
No ratings yet
Algo VC Lecture24
32 pages
How To Multiply: 5.5 Integer Multiplication
No ratings yet
How To Multiply: 5.5 Integer Multiplication
16 pages
Csce411 Divideconquer2
No ratings yet
Csce411 Divideconquer2
12 pages
Strassen's Matrix Multiplication
No ratings yet
Strassen's Matrix Multiplication
13 pages
Data Structures MCQ
100% (1)
Data Structures MCQ
19 pages
Applying The Wheatstone Bridge Circuit
0% (1)
Applying The Wheatstone Bridge Circuit
36 pages
Stationary & Non-Stationary Processes
No ratings yet
Stationary & Non-Stationary Processes
17 pages
Matrix Chain Mult
No ratings yet
Matrix Chain Mult
11 pages
Back To Basics:: Loop Vs Line Power
No ratings yet
Back To Basics:: Loop Vs Line Power
4 pages
The Macross Saga Errata 1.5
No ratings yet
The Macross Saga Errata 1.5
5 pages
CS124 Spring 2011: (N) Is The Number of Comparisons, Then T (N) 2T (n/2) + 2. (The 2T (n/2) Term Comes From
No ratings yet
CS124 Spring 2011: (N) Is The Number of Comparisons, Then T (N) 2T (n/2) + 2. (The 2T (n/2) Term Comes From
4 pages
Strassen Matrix Multiplication
No ratings yet
Strassen Matrix Multiplication
11 pages
Strassen's Matrix Multiplication: Sibel Kirmizigül
No ratings yet
Strassen's Matrix Multiplication: Sibel Kirmizigül
11 pages
A Short Fuzzy Logic Tutorial
No ratings yet
A Short Fuzzy Logic Tutorial
6 pages
Libya Free High Study Academy / Misrata: - Report of Address
No ratings yet
Libya Free High Study Academy / Misrata: - Report of Address
19 pages
Matrix Manipulate
No ratings yet
Matrix Manipulate
59 pages
CSI 2110 Summary PDF
No ratings yet
CSI 2110 Summary PDF
17 pages
Lecture 33 Algebraic Computation and FFTs
No ratings yet
Lecture 33 Algebraic Computation and FFTs
16 pages
MITP CH 3 Lecture Notes
No ratings yet
MITP CH 3 Lecture Notes
25 pages
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
From Everand
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
Peter Kattan
3/5 (4)
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)