# Matrix Multiplication

Sunday, March 30, 2014

In this notebook, we'll be using Julia to investigate the efficiency of matrix multiplication algorithms. In practice, you don't want to use anything presented here - you should instead use the hyperoptimized algorithms provided by BLAS. Not only are the BLAS implementations likely to be more carefully written, but they are also optimized in a very low-level manner. Cache and processor aware optimizations, for instance, can make a great deal of difference when working with large matrices!

With that disclaimer out of the way, let's get started. First off, let's import Gadfly and DataFrames for plotting.

In [1]:
using Gadfly
using DataFrames


## Standard Matrix Multiplication

The first algorithm we'll implement is straightforward matrix multiplication, like you learned in high school.

If $C = AB$ is the product of matrices $A$ and $B$, then $C_{ij}$ is the dot product of the $i$th row of $A$ with the $j$th column of $B$. A trivial implementation follows.

In [1]:
# Standard matrix multiplication algorithm.
function mult{T}(x :: Matrix{T}, y :: Matrix{T})
# Check that the sizes of these matrices match.
(r1, c1) = size(x)
(r2, c2) = size(y)
if c1 != r2
error("Multiplying $r1 x$c1 and $r2 x$c2 matrix: dimensions do not match.")
end

# Get value at (i, j)th cell by taking the dot product
# of the ith row of the first matrix and the jth column of the second.
at(i, j) = (x[i, :] * y[:, j])[1]
T[at(i, j) for i = 1:r1, j = 1:c2]
end;


The implementation above has a very small problem. Looking at the line

at(i, j) = (x[i, :] * y[:, j])[1]


we can see that for every element, our function computes an intermediate 1x1 matrix and then gets its value out. Instead, we can simply use an accumulator and avoid the memory allocation, as follows:

In [1]:
# Standard matrix multiplication algorithm.
function mult{T}(x :: Matrix{T}, y :: Matrix{T})
# Check that the sizes of these matrices match.
(r1, c1) = size(x)
(r2, c2) = size(y)
if c1 != r2
error("Multiplying $r1 x$c1 and $r2 x$c2 matrix: dimensions do not match.")
end

# Get value at (i, j)th cell by taking the dot product
# of the ith row of the first matrix and the jth column of the second.
function at(i, j)
accum = x[i, 1] * y[1, j]
for k = 2:c1
accum += x[i, k] * y[k, j]
end
accum
end
T[at(i, j) for i = 1:r1, j = 1:c2]
end;


## Strassen's Algorithm

A practical alternative (for larger matrices) to the standard $O(n^3)$ algorithm we implemented above is Strassen's Algorithm. Since this is fairly well described on Wikipedia, I won't repeat the description here.

First, we'll implement a naive version of Strassen's algorithm, in which we multiply matrices by putting them into the top left of a $2^N \times 2^N$ matrix. This is inadequate in practice, as we allocate tons of extra memory, and multiplying a $513\times 513$ matrix takes as much time as a $1024\times 1024$ matrix. In order to make Strassen's algorithm practical, we resort to standard matrix multiplication for small matrices.

In [1]:
# Strassen's matrix multiplication algorithm.
#
# Given two matrices A and B, start by padding them to be the same size, where
# the number of rows and columns is a power of two. Then, it subdivides the
# matrices into 2x2 block matrices, and uses a faster algorithm for multiplying
# 2x2 matrices.  The 2x2 multiplication algorithm uses more additions and
# subtractions in order to reduce the number of multiplications to 7 instead of
# 8. It uses seven intermediary values, each of which requires only one
# multiplication; then, it combines those to get the blocks of the output
# matrix.  At each divide step, the size of the matrix is divided by two, but
# it requires seven multiplications. As a result, the algorithm is O(n^lg 7),
# which is faster than the naive O(n^3).
function strassen_naive{T}(x :: Matrix{T}, y :: Matrix{T})
# Check that the sizes of these matrices match.
(r1, c1) = size(x)
(r2, c2) = size(y)
if c1 != r2
error("Multiplying $r1 x$c1 and $r2 x$c2 matrix: dimensions do not match.")
end

# Pad the matrices with zeros to make them powers of two.
pow2(x) = 2 ^ int(ceil(log(2, x)))
width = maximum(map(pow2, [r1, r2, c1, c2]))
x = [x zeros(T, r1, width - c1); zeros(T, width - r1, width)]
y = [y zeros(T, r2, width - c2); zeros(T, width - r2, width)]

function block_mult(a, b, width)
# For small matrices, resort to naive multiplication.
half = int(width / 2)
if half <= 128
return mult(a, b)
end

# Subdivide input matrices.
a11 = a[1:half, 1:half];             b11 = b[1:half, 1:half]
a12 = a[1:half, half+1:width];       b12 = b[1:half, half+1:width]
a21 = a[half+1:width, 1:half];       b21 = b[half+1:width, 1:half]
a22 = a[half+1:width, half+1:width]; b22 = b[half+1:width, half+1:width]

# Compute intermediate values.
multip(x, y) = block_mult(x, y, half)
m1 = multip(a11 + a22, b11 + b22)
m2 = multip(a21 + a22, b11)
m3 = multip(a11, b12 - b22)
m4 = multip(a22, b21 - b11)
m5 = multip(a11 + a12, b22)
m6 = multip(a21 - a11, b11 + b12)
m7 = multip(a12 - a22, b21 + b22)

# Combine intermediate values into the output.
c11 = m1 + m4 - m5 + m7
c12 = m3 + m5
c21 = m2 + m4
c22 = m1 - m2 + m3 + m6

[c11 c12; c21 c22]
end

# Cut out the piece of the matrix that we want.
end;


We can tweak Strassen's algorithm to make it significantly more efficient. The main issue at this point is the fact that we require our matrices to have dimensions that are a power of two, so that we can apply our divide and conquer on quarters of the matrix.

However, what we actually need is that at each step our matrix can be divided into four evenly sized blocks. Instead of changing the size at the beginning, we can instead maintain this invariant at each recursive step. There are two approaches to this, one known as dynamic padding and one known as dynamic peeling.

In order to use dynamic padding, we have several cases in our recursive step. If our matrix already has an even number of rows and columns, we do not need to do anything, as we can simply split it into four blocks. However, if the number of rows or the number of columns is odd, we simply add another row or column (or both, if needed). We initialize this row or column to zeros, and perform the multiplication as normal.

Once we have multiplied these matrices, we remove the extra row or column, and return our result. We are guaranteed that the result would be the same as if we had multiplied these matrices using the naive method.

Using dynamic padding keeps your logic fairly simple, but avoids huge memory allocations and increases to the size of your matrix. The implementation below uses dynamic padding.

### Dynamic Peeling

An alternative exists to dynamic padding known as dynamic peeling. Instead of adding a row or column to make your matrix even sized, you instead remove a row or column. After performing the remaining multiplication, you account specially for your removed row or column. This requires a little bit of extra arithmetic, so the code ends up slightly more complex, and I am not sure whether it has any performance benefits. You can take a more in depth look on page six of this paper.

An implementation of Strassen's algorithm using dynamic padding follows.

In [1]:
# Strassen's matrix multiplication algorithm.
#
# Use dynamic padding in order to reduce required auxiliary memory.
function strassen{T}(x :: Matrix{T}, y :: Matrix{T})
# Check that the sizes of these matrices match.
(r1, c1) = size(x)
(r2, c2) = size(y)
if c1 != r2
error("Multiplying $r1 x$c1 and $r2 x$c2 matrix: dimensions do not match.")
end

# Put a matrix into the top left of a matrix of zeros.
# rows and cols are the dimensions of the output matrix.
function embed(mat, rows, cols)
# If the matrix is already of the right dimensions, don't allocate new memory.
(r, c) = size(mat)
if (r, c) == (rows, cols)
return mat
end

# Pad the matrix with zeros to be the right size.
out = zeros(T, rows, cols)
out[1:r, 1:c] = mat
out
end

# Make sure both matrices are the same size.
# This is exclusively for simplicity:
# this algorithm can be implemented with matrices of different sizes.
r = max(r1, r2); c = max(c1, c2)
x = embed(x, r, c)
y = embed(y, r, c)

# Our recursive multiplication function.
function block_mult(a, b, rows, cols)
# For small matrices, resort to naive multiplication.
if rows <= 128 || cols <= 128
return mult(a, b)
end

if rows % 2 == 1 && cols % 2 == 1
a = embed(a, rows + 1, cols + 1)
b = embed(b, rows + 1, cols + 1)
elseif rows % 2 == 1
a = embed(a, rows + 1, cols)
b = embed(b, rows + 1, cols)
elseif cols % 2 == 1
a = embed(a, rows, cols + 1)
b = embed(b, rows, cols + 1)
end

half_rows = int(size(a, 1) / 2)
half_cols = int(size(a, 2) / 2)

# Subdivide input matrices.
a11 = a[1:half_rows, 1:half_cols]
b11 = b[1:half_rows, 1:half_cols]

a12 = a[1:half_rows, half_cols+1:size(a, 2)]
b12 = b[1:half_rows, half_cols+1:size(a, 2)]

a21 = a[half_rows+1:size(a, 1), 1:half_cols]
b21 = b[half_rows+1:size(a, 1), 1:half_cols]

a22 = a[half_rows+1:size(a, 1), half_cols+1:size(a, 2)]
b22 = b[half_rows+1:size(a, 1), half_cols+1:size(a, 2)]

# Compute intermediate values.
multip(x, y) = block_mult(x, y, half_rows, half_cols)
m1 = multip(a11 + a22, b11 + b22)
m2 = multip(a21 + a22, b11)
m3 = multip(a11, b12 - b22)
m4 = multip(a22, b21 - b11)
m5 = multip(a11 + a12, b22)
m6 = multip(a21 - a11, b11 + b12)
m7 = multip(a12 - a22, b21 + b22)

# Combine intermediate values into the output.
c11 = m1 + m4 - m5 + m7
c12 = m3 + m5
c21 = m2 + m4
c22 = m1 - m2 + m3 + m6

# Crop output to the desired size (undo dynamic padding).
out = [c11 c12; c21 c22]
out[1:rows, 1:cols]
end

block_mult(x, y, r, c)
end;


## Benchmarking

Now that we have these algorithms, let's test them with a simple benchmark! We'll generate random matrices of varying sizes and then see how quickly we can multiply them.

In [1]:
# Generate many random matrices of varying sizes.
sizes = [30:30:1000]
matrices = [(rand(x, x), rand(x, x)) for x = sizes]

# Compute how long it takes naive multiplication and Strassen's algorithm.
naive_times = [@elapsed mult(x, y) for (x, y) = matrices];
strassen_times = [@elapsed strassen(x, y) for (x, y) = matrices];

# Also, let's see how quickly Julia's native multiplication is in comparison.
julia_times = [@elapsed x * y for (x, y) = matrices];


We can stick these things in a DataFrame for plotting and display.

In [2]:
values = DataFrame(Size=sizes, Naive=naive_times, Strassen=strassen_times, Julia=julia_times)
values[end-10:end, :]

Out[2]:
SizeNaiveStrassenJulia
16901.8530196461.2080344460.035775962
27201.649420761.2882501860.04968152
37502.2360003481.5590512640.061651594
47803.821561921.7604262370.063690848
58103.1381190871.8109201280.07922887
68404.1677494242.0474652580.078988685
78703.4696765452.1764767080.09821168
89003.3635665562.356408860.085557528
99304.4585664992.8001575040.100970208
109604.5071327692.9415160550.101309561
119906.6504592773.2205783790.130329747

It's fairly easy to see that native BLAS multiplication is orders of magnitude faster than both our methods, but the Strassen algorithm really is significantly outperforming the naive matrix multiplication for matrices larger than around $600 \times 600$. Finally, let's take a look at this graphically!

In [3]:
naive_plot    = layer(Geom.point, x=:Size, y=:Naive,    Theme(default_color=color("red")))
strassen_plot = layer(Geom.point, x=:Size, y=:Strassen, Theme(default_color=color("blue")))
julia_plot = layer(Geom.point, x=:Size, y=:Julia, Theme(default_color=color("green")))
display(plot(values, naive_plot, strassen_plot, julia_plot));