逐步解說：矩陣乘法

發行項
03/01/2013

這個逐步解說會示範如何使用 C++ AMP 加速矩陣乘法的執行。會出現兩種演算法，一個不並排顯示，另一個並排顯示。

必要條件

開始之前：

讀取 C++ AMP 概觀。
讀取使用磚。
請確定你已經在電腦上安裝了Windows 7， Windows 8， Windows Server 2008 R2，或Windows Server 2012。

若要建立專案

在 Visual Studio的功能表列，選擇 [ 檔案]，新增，專案。
在已安裝 在 [範本] 窗格中，選取 Visual C++。
選取 空專案，輸入 MatrixMultiply 在名稱方塊，然後再選擇確定按鈕。
選擇下一步 ] 按鈕。
在方案總管中，開啟的快顯功能表 原始程式檔，然後選擇新增， 新的項目。
在 [增加新的項目] 對話框中，選擇 [C++ File (.cpp)] ，輸入 MatrixMultiply.cpp於 [檔名] 方塊中，然後選擇 [新增] 的按鈕。

無 tiling 的乘法

在本節中，會考慮兩個矩陣a 和 b的相乘，會以下列方式定義：

3x2 矩陣 2x3 矩陣

A 是 3-2 矩陣，而 b 則 2-3 矩陣。a，b相乘的乘積為以下的 3 x 3 矩陣。乘積的計算方式是逐項將A的資料列乘以B的資料行。

3x3 矩陣

要相乘而不使用 C++ AMP

開啟 MatrixMultiply.cpp 然後使用下列程式碼來取代現有的程式碼。

#include <iostream>

void MultiplyWithOutAMP() {

    int aMatrix[3][2] = {{1, 4}, {2, 5}, {3, 6}};
    int bMatrix[2][3] = {{7, 8, 9}, {10, 11, 12}};
    int product[3][3] = {{0, 0, 0}, {0, 0, 0}, {0, 0, 0}};

    for (int row = 0; row < 3; row++) {
        for (int col = 0; col < 3; col++) {
            // Multiply the row of A by the column of B to get the row, column of product.
            for (int inner = 0; inner < 2; inner++) {
                product[row][col] += aMatrix[row][inner] * bMatrix[inner][col];
            }
            std::cout << product[row][col] << "  ";
        }
        std::cout << "\n";
    }
}

void main() {
    MultiplyWithOutAMP();
    getchar();
}

這個演算法直接實作了矩陣乘法的定義。其並未使用任何平行或是多執行緒演算法來減少運算時間。

在功能表列上，選擇檔案， 全部儲存。
選取 F5 鍵盤快速鍵啟動偵錯和驗證輸出是正確的。
選擇Enter結束應用程式。

若要使用 C++ AMP 值相乘

在 MatrixMultiply.cpp 中，加入下列程式碼在main方法之前。

void MultiplyWithAMP() {
    int aMatrix[] = { 1, 4, 2, 5, 3, 6 };
    int bMatrix[] = { 7, 8, 9, 10, 11, 12 };
    int productMatrix[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 };

    array_view<int, 2> a(3, 2, aMatrix);
    array_view<int, 2> b(2, 3, bMatrix);
    array_view<int, 2> product(3, 3, productMatrix);

    parallel_for_each(
        product.extent, 
         [=](index<2> idx) restrict(amp) {
            int row = idx[0];
            int col = idx[1];
            for (int inner = 0; inner < 2; inner++) {
                product[idx] += a(row, inner) * b(inner, col);
            }
        }
    );

    product.synchronize();

    for (int row = 0; row < 3; row++) {
        for (int col = 0; col < 3; col++) {
            //std::cout << productMatrix[row*3 + col] << "  ";
            std::cout << product(row, col) << "  ";
        }
        std::cout << "\n";
    }
}

放大器程式碼類似於非 AMP 程式碼。呼叫parallel_for_each啟動每個項目中的一個執行緒product.extent，並取代for迴圈的列和資料行。位於資料列和資料行的儲存格的值適用於idx。您可以存取array_view物件的項目，藉由使用[]運算子和索引變數，或()運算子和資料列和欄的變數。此範例會示範這兩種方法。array_view::synchronize方法會複製product的值再傳回給productMatrix變數。

將下列加入include和using陳述式在MatrixMultiply.cpp 頂端。
```
#include <amp.h>
using namespace concurrency;
```

修改main方法以呼叫MultiplyWithAMP方法。

void main() {
    MultiplyWithOutAMP();
    MultiplyWithAMP();
    getchar();
}

選擇啟動偵錯的 Ctrl + F5 鍵盤快速鍵，請檢查輸出正確。
選擇 [空格鍵來結束應用程式]。

乘法與並排顯示

並排顯示是您將資料分割成大小相等的子集的一種技術，廣泛稱為tiles。三個項目變更當您使用並排顯示。

您可以建立tile_static變數。在tile_static空間的資料的存取權可以比全域的空間中的資料存取快好幾倍。tile_static變數的一個執行個體會為每個堆疊建立，且並排顯示中的所有執行緒都可以存取該變數。並排顯示的主要優點是受限於tile_static存取的效能改善比率。
您可以呼叫 tile_barrier::wait 方法即可停止所有的執行緒在指定的一行程式碼的一個方塊中。您不能保證執行緒再次執行的順序，只能讓一個方塊中的所有執行緒都停止，當呼叫tile_barrier::wait，在它們繼續執行之前。
您可以存取相對於整個array_view物件的執行緒的索引和存取相對於並排顯示的索引。藉由使用本機的索引，您可以讓您的程式碼容易閱讀及偵錯。

若要充分運用並排顯示的矩陣相乘的演算，此演算法必須將矩陣分割成並排顯示，然後將並排顯示資料複製到tile_static變數，以利快速存取。在這個範例中，矩陣會分割成相同大小的子矩陣。乘積可以藉由將子矩陣相乘而得到。兩個矩陣和它們的乘積，在這個範例是：

4x4 矩陣

矩陣可分成四個 2x2 矩陣，定義如下：

分割為 2x2 子矩陣的 4x4 矩陣

A 和 B 的乘積可以立即寫入和計算出來，方式如下：

分割為 2x2 子矩陣的 4x4 矩陣

因為矩陣a到h是 2x2 矩陣，所有他們的乘積和加總也是 2x2 矩陣。它還遵守 A * B 是一個 4x4 矩陣，如預期般運作。若要快速檢查演算法，計算第一列中的項目，乘積中的第一個資料行的值。在範例中，那會是ae + bg的第一個資料列和第一欄中的值。您只需要計算aebg的第一個資料行、第一列的每個項目。ae 的值為1*1 + 2*5 = 11.bg 的值為 3*1 + 4*5 = 23。最終的值會是11 + 23 = 34，這完全正確。

若要實作這種演算法，程式碼：

使用tiled_extent物件，而不是extent物件在parallel_for_each呼叫。
使用tiled_index物件，而不是index物件在parallel_for_each呼叫。
會建立tile_static變數來保存子矩陣。
使用tile_barrier::wait方法即可停止執行緒計算子矩陣的乘積。

若要做相乘使用放大器和並排顯示

在 MatrixMultiply.cpp 中，加入下列程式碼在main方法之前。

void MultiplyWithTiling()
{
    // The tile size is 2.
    static const int TS = 2;

    // The raw data.
    int aMatrix[] =       { 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8 };
    int bMatrix[] =       { 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8 };
    int productMatrix[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };

    // Create the array_view objects.
    array_view<int, 2> a(4, 4, aMatrix);
    array_view<int, 2> b(4, 4, bMatrix);
    array_view<int, 2> product(4, 4, productMatrix);

    // Call parallel_for_each by using  2x2 tiles.
    parallel_for_each(product.extent.tile< TS, TS >(),
        [=] (tiled_index< TS, TS> t_idx) restrict(amp) 
        {
            // Get the location of the thread relative to the tile (row, col) and the entire array_view (rowGlobal, colGlobal).
            int row = t_idx.local[0]; 
            int col = t_idx.local[1];
            int rowGlobal = t_idx.global[0];
            int colGlobal = t_idx.global[1];
            int sum = 0;

            // Given a 4x4 matrix and a 2x2 tile size, this loop executes twice for each thread.
            // For the first tile and the first loop, it copies a into locA and e into locB.
            // For the first tile and the second loop, it copies b into locA and g into locB.
            for (int i = 0; i < 4; i += TS) {
                tile_static int locA[TS][TS];
                tile_static int locB[TS][TS];
                locA[row][col] = a(rowGlobal, col + i);
                locB[row][col] = b(row + i, colGlobal);
                // The threads in the tile all wait here until locA and locB are filled.
                t_idx.barrier.wait();


                // Return the product for the thread. The sum is retained across
                // both iterations of the loop, in effect adding the two products
                // together, for example, a*e.
                for (int k = 0; k < TS; k++) {
                    sum += locA[row][k] * locB[k][col];
                }

                // All threads must wait until the sums are calculated. If any threads
                // moved ahead, the values in locA and locB would change.      
                t_idx.barrier.wait();
                // Now go on to the next iteration of the loop.          
            }

            // After both iterations of the loop, copy the sum to the product variable by using the global location.
            product[t_idx.global] = sum;
    });

        // Copy the contents of product back to the productMatrix variable.
        product.synchronize();

        for (int row = 0; row < 4; row++) {
        for (int col = 0; col < 4; col++) {
            // The results are available from both the product and productMatrix variables.
            //std::cout << productMatrix[row*3 + col] << "  ";
            std::cout << product(row, col) << "  ";
        }
        std::cout << "\n";
    }

}

這個範例和不並排顯示的範例有顯著的差異。程式碼會使用這些概念的步驟執行：

將a的並排顯示 [0，0] 的項目複製到locA。將b的並排顯示 [0，0] 的項目複製到locB。請注意， product會並排顯示，不是a和b。因此，您使用通用的索引來存取a, b，以及product。tile_barrier::wait的呼叫是很重要的。它就會停止並排顯示中所有的執行緒，直到兩者locA和locB會填滿。
將locA和locB 相乘，並將結果放product。
將a的並排顯示 [0，1] 的項目複製到locA。將b的並排顯示 [1，0] 的項目複製到locB。
將locA和locB 相乘，並將其新增至已存在product中的結果。
完成並排顯示 [0，0]的乘法運算。
為其他四個方塊重複。沒有特別為並排顯示的索引，而執行緒可以以任何順序執行。當每個執行緒執行時， tile_static變數適當地為每個並排顯示建立，tile_barrier::wait的呼叫控制程式流向。
當您仔細地檢視演算法，請注意每個子矩陣會被載入至tile_static記憶體兩次。該傳輸的資料的確花時間。然而，一旦資料位於tile_static記憶體，存取資料變得更快。因為計算產品需要重複的存取子矩陣中的值，整體效能比率沒有改善。每種演算法，需要實驗才能找出最佳的演算法和並排顯示大小。

在非放大器和非並排顯示範例中，每個 a 和 b 的元素會被從全域記憶體存取四次來計算乘積。在並排顯示的範例中，每個項目被從全域記憶體存取兩次和從 tile_static的記憶體存取四次。這並不是顯著的效能改善比率。不過，如果 A 和 B 是 1024x1024 矩陣而且並排顯示大小是 16，會有顯著的效能改善比率。在此情況下，每個項目會只被複製到tile_static記憶體16 次，而且從tile_static記憶體被存取 1024次。

若要修改主方法來呼叫 MultiplyWithTiling ，如下所示。

void main() {
    MultiplyWithOutAMP();
    MultiplyWithAMP();
    MultiplyWithTiling();
    getchar();
}

選擇啟動偵錯的 Ctrl + F5 鍵盤快速鍵，請檢查輸出正確。
選擇 [空格鍵以結束應用程式]。

請參閱

工作

逐步解說：偵錯 C++ AMP 應用程式

其他資源

C++ AMP (C++ Accelerated Massive Parallelism)

共用方式為

逐步解說：矩陣乘法

必要條件

若要建立專案

無 tiling 的乘法

要相乘而不使用 C++ AMP

若要使用 C++ AMP 值相乘

乘法與並排顯示

若要做相乘使用放大器和並排顯示

請參閱

工作

其他資源

其他資源