Optimize Your Code: Matrix Multiplication

Article
04/28/2009

Matrix multiplication is common and the algorithm is easy to implementation. Here is one example:

Version 1:

template<typename T>
void SeqMatrixMult1(int size, T** m1, T** m2, T** result)
{
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            result[i][j] = 0;
            for (int k = 0; k < size; k++) {
                result[i][j] += m1[i][k] * m2[k][j];
            }
        }
    }
}

This implementation is straight-forward and you can find it in text book and many online samples.

Version 2:

template<typename T>
void SeqMatrixMult2(int size, T** m1, T** m2, T** result)
{
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            T c = 0;
            for (int k = 0; k < size; k++) {
                c += m1[i][k] * m2[k][j];
            }
            result[i][j] = c;
        }
    }
}

This version will use a temporary to store the intermediate result. So we can save a lot of unnecessary memory write. Notice that the optimizer can not help here because it doesn't know whether "result" is an alias of "m1" or "m2".

Version 3:

template<typename T>
void Transpose(int size, T** m)
{
    for (int i = 0; i < size; i++) {
        for (int j = i + 1; j < size; j++) {
            std::swap(m[i][j], m[j][i]);
        }
    }
}
template<typename T>
void SeqMatrixMult3(int size, T** m1, T** m2, T** result)
{
    Transpose(size, m2);
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            T c = 0;
            for (int k = 0; k < size; k++) {
                c += m1[i][k] * m2[j][k];
            }
            result[i][j] = c;
        }
    }
    Transpose(size, m2);
}

This optimization is tricky. If you profile the function, you'll find a lot of data cache miss. We transpose the matrix so that both m1[i] and m2[i] can be accessed sequentially. This can greatly improve the memory read performance.

Version 4:

template<typename T>
void SeqMatrixMult4(int size, T** m1, T** m2, T** result);
// assume size % 2 == 0
// assume m1[i] and m2[i] are 16-byte aligned
// require SSE3 (haddpd)
template<>
void SeqMatrixMult4(int size, double** m1, double** m2, double** result)
{
    Transpose(size, m2);
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            __m128d c = _mm_setzero_pd();

            for (int k = 0; k < size; k += 2) {
                c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(&m1[i][k]), _mm_load_pd(&m2[j][k])));
            }
            c = _mm_hadd_pd(c, c);
            _mm_store_sd(&result[i][j], c);
        }
    }
    Transpose(size, m2);
}
// assume size % 4 == 0
// assume m1[i] and m2[i] are 16-byte aligned
// require SSE3 (haddps)
template<>
void SeqMatrixMult4(int size, float** m1, float** m2, float** result)
{
    Transpose(size, m2);
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            __m128 c = _mm_setzero_ps();

            for (int k = 0; k < size; k += 4) {
                c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(&m1[i][k]), _mm_load_ps(&m2[j][k])));
            }
            c = _mm_hadd_ps(c, c);
            c = _mm_hadd_ps(c, c);
            _mm_store_ss(&result[i][j], c);
        }
    }
    Transpose(size, m2);
}

For float types, we can use SIMD instruction set to parallel the data processing.

Parallel version using PPL (Parallel Patterns Library) and lambda in VC2010 CTP:

template<typename T>
void ParMatrixMult1(int size, T** m1, T** m2, T** result)
{
    using namespace Concurrency;
    for (int i = 0; i < size; i++) {
        parallel_for(0, size, 1, [&](int j) {
            result[i][j] = 0;
            for (int k = 0; k < size; k++) {
                result[i][j] += m1[i][k] * m2[k][j];
            }
        });
    }
}

Result

Here are the test results (what really matters is the relative time between different version):

Matrix size = 500 (Intel Core 2 Duo T7250, 2 cores, L2 cache 2MB)

	int	long long	float	double
Version 1	0.931119s	2.945134s	0.774894s	0.984585s
Version 2	0.571003s	2.310568s	0.724161s	0.929064s
Version 3	0.239538s	0.823095s	0.570772s	0.241691s
Version 4	N/A	N/A	0.063196s	0.187614s
Version 1 + PPL	0.847534s	1.683765s	0.589513s	0.994161s
Version 2 + PPL	0.380174s	1.190713s	0.409321s	0.594859s
Version 3 + PPL	0.135760s	0.495152s	0.370499s	0.185800s
Version 4 + PPL	N/A	N/A	0.041959s	0.157932s

Matrix size = 500 (Intel Xeon E5430, 4 cores, L2 cache 12MB)

	int	long long	float	double
Version 1	0.514330s	1.434509s	0.455168s	0.608127s
Version 2	0.314554s	1.231696s	0.447607s	0.593517s
Version 3	0.180176s	0.591002s	0.432129s	0.149511s
Version 4	N/A	N/A	0.042900s	0.083286s
Version 1 + PPL	0.308766s	0.482934s	0.175585s	0.309159s
Version 2 + PPL	0.105717s	0.325413s	0.124862s	0.164156s
Version 3 + PPL	0.073418s	0.193824s	0.116971s	0.061268s
Version 4 + PPL	N/A	N/A	0.017891s	0.031734s

From the results, you can find that:

Parallelism only helps if you carefully tune your code to maximize its effect (Version 1)
Eliminating unnecessary memory write (Version 2) helps the parallelism
Data cache miss can be a big issue when there are lots of memory access (Version 3)
Using SIMD instead of FPU on aligned data is beneficial (Version 4)
Different data types, data sizes and host architectures may have different kinds of bottlenecks

Comments

Anonymous
April 28, 2009
PingBack from http://microsoft-sharepoint.simplynetdev.com/optimize-your-code-matrix-multiplication/
Anonymous
December 19, 2012
For matrix multiplication AA^T + BB^T, and BA^T - AB^T, is there any optimization method? here, A and B is m*n matrix, T means matrix transpose Thanks.
Anonymous
December 28, 2012
The result of A * Tran(A) is symmetric, so you save about 50% of the computation. Similarly, B * Tran(A) - A * Tran(B) = B * Tran(A) - Tran(B * Tran(A)), so only one matrix multiplication is needed.
Anonymous
August 07, 2014
nice article, help me a lot, thankyou
Anonymous
February 23, 2015
I did in a different way, using OO, pointers and increment operations as follows ahead: // Optimized matrix multiplication R = A * B const Matrix Matrix::operator*(const Matrix &other) const { assert (this->colNum == other.rowNum); // initialize a empty matrix to return Matrix result(this->rowNum, other.colNum, 0.0); // initialize our flags int positionR = 0, positionX = -1, positionA = -1, positionB = -result.colNum; int rowsCount = 0, colsCount = 0; // result elements times iterations needed int limit = result.rowNum * result.colNum * this->colNum; for(int step = 0; step < limit; ++step) { // iteration counter positionX++; // if we reached a complete iteration if(positionX == this->colNum) { positionX = 0; positionR++; colsCount++; // avoid to use modulo operator if(colsCount == result.colNum) { colsCount = 0; rowsCount += result.colNum; } // redefine starts point positionB = colsCount; positionA = rowsCount; } else { // move to forward positions positionA++; positionB += result.colNum; } //access an array using the closest machine run style result.matrixArray[positionR] += this->matrixArray[positionA] * other.matrixArray[positionB]; } /* // TODO otimizar a multiplicação de matrizes for(int positionY = 0; positionY < result.rowNum; ++positionY) for(int positionX = 0; positionX < this->colNum; ++positionX) for(int positionZ = 0; positionZ < other.rowNum; ++positionZ) result.element(positionY, positionX) += this->element(positionY, positionZ) * other.element(positionZ, positionX); */ return result; } Maybe it could be better, without thouse minus initializations. But its enough for me. =D
Anonymous
February 25, 2015
The comment has been removed
Anonymous
February 25, 2015
The comment has been removed
Anonymous
February 25, 2015
change the follow lines: int colSize = this->colNum; by int colSize = other.colNum; i achieved a multiplication by MatrixA[1][3] x MatrixA[3][2], 1.000.000 times in 242 milliseconds, using the E7500 processor.
Anonymous
September 20, 2016
I followed most of the vectorization code, but can you please explain use of these:c = _mm_hadd_pd(c, c); OR c = _mm_hadd_ps(c, c);Thank you.
- Anonymous
  October 06, 2016
  The comment has been removed
Anonymous
April 12, 2017
this article helps me a lot,thank you

Share via

Optimize Your Code: Matrix Multiplication

Comments

Additional resources