Optimize Your Code: Matrix Multiplication
Matrix multiplication is common and the algorithm is easy to implementation. Here is one example:
Version 1:
template<typename T>
void SeqMatrixMult1(int size, T** m1, T** m2, T** result)
{
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
result[i][j] = 0;
for (int k = 0; k < size; k++) {
result[i][j] += m1[i][k] * m2[k][j];
}
}
}
}
This implementation is straight-forward and you can find it in text book and many online samples.
Version 2:
template<typename T>
void SeqMatrixMult2(int size, T** m1, T** m2, T** result)
{
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
T c = 0;
for (int k = 0; k < size; k++) {
c += m1[i][k] * m2[k][j];
}
result[i][j] = c;
}
}
}
This version will use a temporary to store the intermediate result. So we can save a lot of unnecessary memory write. Notice that the optimizer can not help here because it doesn't know whether "result" is an alias of "m1" or "m2".
Version 3:
template<typename T>
void Transpose(int size, T** m)
{
for (int i = 0; i < size; i++) {
for (int j = i + 1; j < size; j++) {
std::swap(m[i][j], m[j][i]);
}
}
}
template<typename T>
void SeqMatrixMult3(int size, T** m1, T** m2, T** result)
{
Transpose(size, m2);
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
T c = 0;
for (int k = 0; k < size; k++) {
c += m1[i][k] * m2[j][k];
}
result[i][j] = c;
}
}
Transpose(size, m2);
}
This optimization is tricky. If you profile the function, you'll find a lot of data cache miss. We transpose the matrix so that both m1[i] and m2[i] can be accessed sequentially. This can greatly improve the memory read performance.
Version 4:
template<typename T>
void SeqMatrixMult4(int size, T** m1, T** m2, T** result);
// assume size % 2 == 0
// assume m1[i] and m2[i] are 16-byte aligned
// require SSE3 (haddpd)
template<>
void SeqMatrixMult4(int size, double** m1, double** m2, double** result)
{
Transpose(size, m2);
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
__m128d c = _mm_setzero_pd();
for (int k = 0; k < size; k += 2) {
c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(&m1[i][k]), _mm_load_pd(&m2[j][k])));
}
c = _mm_hadd_pd(c, c);
_mm_store_sd(&result[i][j], c);
}
}
Transpose(size, m2);
}
// assume size % 4 == 0
// assume m1[i] and m2[i] are 16-byte aligned
// require SSE3 (haddps)
template<>
void SeqMatrixMult4(int size, float** m1, float** m2, float** result)
{
Transpose(size, m2);
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
__m128 c = _mm_setzero_ps();
for (int k = 0; k < size; k += 4) {
c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(&m1[i][k]), _mm_load_ps(&m2[j][k])));
}
c = _mm_hadd_ps(c, c);
c = _mm_hadd_ps(c, c);
_mm_store_ss(&result[i][j], c);
}
}
Transpose(size, m2);
}
For float types, we can use SIMD instruction set to parallel the data processing.
Parallel version using PPL (Parallel Patterns Library) and lambda in VC2010 CTP:
template<typename T>
void ParMatrixMult1(int size, T** m1, T** m2, T** result)
{
using namespace Concurrency;
for (int i = 0; i < size; i++) {
parallel_for(0, size, 1, [&](int j) {
result[i][j] = 0;
for (int k = 0; k < size; k++) {
result[i][j] += m1[i][k] * m2[k][j];
}
});
}
}
Result
Here are the test results (what really matters is the relative time between different version):
Matrix size = 500 (Intel Core 2 Duo T7250, 2 cores, L2 cache 2MB)
int | long long | float | double | |
Version 1 | 0.931119s | 2.945134s | 0.774894s | 0.984585s |
Version 2 | 0.571003s | 2.310568s | 0.724161s | 0.929064s |
Version 3 | 0.239538s | 0.823095s | 0.570772s | 0.241691s |
Version 4 | N/A | N/A | 0.063196s | 0.187614s |
Version 1 + PPL | 0.847534s | 1.683765s | 0.589513s | 0.994161s |
Version 2 + PPL | 0.380174s | 1.190713s | 0.409321s | 0.594859s |
Version 3 + PPL | 0.135760s | 0.495152s | 0.370499s | 0.185800s |
Version 4 + PPL | N/A | N/A | 0.041959s | 0.157932s |
Matrix size = 500 (Intel Xeon E5430, 4 cores, L2 cache 12MB)
int | long long | float | double | |
Version 1 | 0.514330s | 1.434509s | 0.455168s | 0.608127s |
Version 2 | 0.314554s | 1.231696s | 0.447607s | 0.593517s |
Version 3 | 0.180176s | 0.591002s | 0.432129s | 0.149511s |
Version 4 | N/A | N/A | 0.042900s | 0.083286s |
Version 1 + PPL | 0.308766s | 0.482934s | 0.175585s | 0.309159s |
Version 2 + PPL | 0.105717s | 0.325413s | 0.124862s | 0.164156s |
Version 3 + PPL | 0.073418s | 0.193824s | 0.116971s | 0.061268s |
Version 4 + PPL | N/A | N/A | 0.017891s | 0.031734s |
From the results, you can find that:
- Parallelism only helps if you carefully tune your code to maximize its effect (Version 1)
- Eliminating unnecessary memory write (Version 2) helps the parallelism
- Data cache miss can be a big issue when there are lots of memory access (Version 3)
- Using SIMD instead of FPU on aligned data is beneficial (Version 4)
- Different data types, data sizes and host architectures may have different kinds of bottlenecks
Comments
Anonymous
April 28, 2009
PingBack from http://microsoft-sharepoint.simplynetdev.com/optimize-your-code-matrix-multiplication/Anonymous
December 19, 2012
For matrix multiplication AA^T + BB^T, and BA^T - AB^T, is there any optimization method? here, A and B is m*n matrix, T means matrix transpose Thanks.Anonymous
December 28, 2012
The result of A * Tran(A) is symmetric, so you save about 50% of the computation. Similarly, B * Tran(A) - A * Tran(B) = B * Tran(A) - Tran(B * Tran(A)), so only one matrix multiplication is needed.Anonymous
August 07, 2014
nice article, help me a lot, thankyouAnonymous
February 23, 2015
I did in a different way, using OO, pointers and increment operations as follows ahead: // Optimized matrix multiplication R = A * B const Matrix Matrix::operator*(const Matrix &other) const { assert (this->colNum == other.rowNum); // initialize a empty matrix to return Matrix result(this->rowNum, other.colNum, 0.0); // initialize our flags int positionR = 0, positionX = -1, positionA = -1, positionB = -result.colNum; int rowsCount = 0, colsCount = 0; // result elements times iterations needed int limit = result.rowNum * result.colNum * this->colNum; for(int step = 0; step < limit; ++step) { // iteration counter positionX++; // if we reached a complete iteration if(positionX == this->colNum) { positionX = 0; positionR++; colsCount++; // avoid to use modulo operator if(colsCount == result.colNum) { colsCount = 0; rowsCount += result.colNum; } // redefine starts point positionB = colsCount; positionA = rowsCount; } else { // move to forward positions positionA++; positionB += result.colNum; } //access an array using the closest machine run style result.matrixArray[positionR] += this->matrixArray[positionA] * other.matrixArray[positionB]; } /* // TODO otimizar a multiplicação de matrizes for(int positionY = 0; positionY < result.rowNum; ++positionY) for(int positionX = 0; positionX < this->colNum; ++positionX) for(int positionZ = 0; positionZ < other.rowNum; ++positionZ) result.element(positionY, positionX) += this->element(positionY, positionZ) * other.element(positionZ, positionX); */ return result; } Maybe it could be better, without thouse minus initializations. But its enough for me. =DAnonymous
February 25, 2015
The comment has been removedAnonymous
February 25, 2015
The comment has been removedAnonymous
February 25, 2015
change the follow lines: int colSize = this->colNum; by int colSize = other.colNum; i achieved a multiplication by MatrixA[1][3] x MatrixA[3][2], 1.000.000 times in 242 milliseconds, using the E7500 processor.Anonymous
September 20, 2016
I followed most of the vectorization code, but can you please explain use of these:c = _mm_hadd_pd(c, c); OR c = _mm_hadd_ps(c, c);Thank you.- Anonymous
October 06, 2016
The comment has been removed
- Anonymous
Anonymous
April 12, 2017
this article helps me a lot,thank you