使用磚

發行項
03/01/2013

您可以使用 tiling 以最大化應用程式的加速效果。Tiling 將執行緒分成一由相同矩形構成的子集或多個tile。如果您使用一個適當的 tile 大小和 tile 的拼貼演算法，就可以從 C++ AMP 程式碼得到更多的加速效果。並排顯示的基本元件如下：

tile_static變數。並排顯示的主要優點是從 tile_static 存取權得到的效能。在 tile_static 記憶體中存取資料會比在全域空間 (array 或 array_view 物件) 中存取明顯地快速。tile_static變數的一個執行個體會為每個堆疊建立，且並排顯示中的所有執行緒都可以存取該變數。在典型 tile 的演算法中，資料從全域記憶體複製到tile_static記憶體一次，而後再從tile_static記憶體存取數次。
tile_barrier::wait 方法.直至所有同一個 tile 中的其他執行緒均呼叫 tile_barrier::wait 前，呼叫 tile_barrier::wait 的執行緒將暫停執行。除了一個執行緒在呼叫 tile_barrier::wait 後，同一並排顯示中的所有執行緒也呼叫它以前，該執行緒不會執行以外，您無法保證各執行緒的執行順序。這意味您可以透過使用 tile_barrier::wait 方法，執行逐個 tile 為基礎的工作 (而非逐條執行緒為基礎的工作)。典型的 tiling 演算法具有在呼叫 tile_barrer::wait 後初始化整個 tile 的 tile_static 記憶體的程式碼。跟著tile_barrier::wait的程式碼包含那些需要存取所有tile_static的值的計算。
區域與全域索引。您可以存取相對於整個 array_view 或 array 物件的執行緒索引值和相對於 tile 的索引值。使用區域索引可以讓您的程式碼易於閱讀及除錯。通常，您會使用區域索引存取 tile_static 變數，全域索引存取 array 和 array_view 變數。
tiled_extent 類別和 tiled_index 類別。在呼叫 parallel_for_each 時，您應使用一 tiled_extent 物件，而非一 extent 物件。在呼叫 parallel_for_each 時，您應使用一 tiled_index 物件，而非一 index 物件。

若要使用 tiling，演算法必須將 compute domain 分割成數個 tile ，然後複製 tile 資料到 tile_static 變數以獲得較快的存取速度。

全域、 Tile，以及區域索引的範例

下圖表示資料以 2x3 的 tile 所排列成的 8x9 矩陣。

8x9 矩陣分成 2x3 並排矩陣

下列範例會顯示此 tiled 的矩陣的全域、 tile，以及區域索引。一個 array_view 物件是使用型別為 Description的項目建立的。Description 保留此矩陣裡全域、 tile ，以及區域索引的項目。呼叫 parallel_for_each 的程式碼設定每個項目的全域、 tile ，以及區域的索引值。輸出會顯示在 Description 結構裡的值。

#include <iostream>
#include <iomanip>
#include <Windows.h>
#include <amp.h>
using namespace concurrency;

const int ROWS = 8;
cons tint COLS = 9;

// tileRow and tileColumn specify the tile that each thread is in.
// globalRow and globalColumn specify the location of the thread in the array_view.
// localRow and localColumn specify the location of the thread relative to the tile.
struct Description {
    int value;
    int tileRow;
    int tileColumn;
    int globalRow;
    int globalColumn;
    int localRow;
    int localColumn;
};

// A helper function for formatting the output.
void SetConsoleColor(int color) {
    int colorValue = (color == 0) ? 4 : 2;
    SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE), colorValue);
}

// A helper function for formatting the output.
void SetConsoleSize(int height, int width) {
    COORD coord; coord.X = width; coord.Y = height;
    SetConsoleScreenBufferSize(GetStdHandle(STD_OUTPUT_HANDLE), coord);
    SMALL_RECT* rect = new SMALL_RECT();
    rect->Left = 0; rect->Top = 0; rect->Right = width; rect->Bottom = height;
    SetConsoleWindowInfo(GetStdHandle(STD_OUTPUT_HANDLE), true, rect);
}

// This method creates a 4x4 matrix of Description structures. In the call to parallel_for_each, the structure is updated 
// with tile, global, and local indices.
void TilingDescription() {
    // Create 16 (4x4) Description structures.
    std::vector<Description> descs;
    for (int i = 0; i < ROWS * COLS; i++) {
        Description d = {i, 0, 0, 0, 0, 0, 0};
        descs.push_back(d);
    }

    // Create an array_view from the Description structures.
    extent<2> matrix(ROWS, COLS);
    array_view<Description, 2> descriptions(matrix, descs);

    // Update each Description with the tile, global, and local indices.
    parallel_for_each(descriptions.extent.tile< 2, 3>(),
         [= ] (tiled_index< 2, 3> t_idx) restrict(amp) 
    {
        descriptions[t_idx].globalRow = t_idx.global[0];
        descriptions[t_idx].globalColumn = t_idx.global[1];
        descriptions[t_idx].tileRow = t_idx.tile[0];
        descriptions[t_idx].tileColumn = t_idx.tile[1];
        descriptions[t_idx].localRow = t_idx.local[0];
        descriptions[t_idx].localColumn= t_idx.local[1];
    });

    // Print out the Description structure for each element in the matrix.
    // Tiles are displayed in red and green to distinguish them from each other.
    SetConsoleSize(100, 150);
    for (int row = 0; row < ROWS; row++) {
        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Value: " << std::setw(2) << descriptions(row, column).value << "      ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Tile:   " << "(" << descriptions(row, column).tileRow << "," << descriptions(row, column).tileColumn << ")  ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Global: " << "(" << descriptions(row, column).globalRow << "," << descriptions(row, column).globalColumn << ")  ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Local:  " << "(" << descriptions(row, column).localRow << "," << descriptions(row, column).localColumn << ")  ";
        }
        std::cout << "\n";
        std::cout << "\n";
    }
}

void main() {
    TilingDescription();
    char wait;
    std::cin >> wait;
}

此範例的主要目的是定義 array_view 物件和parallel_for_each 的呼叫。

Description 結構向量將複製到 8x9 的 array_view 物件。
以 tiled_extent 做為 compute domain 呼叫 parallel_for_each 方法。透過呼叫 descriptions 的 extent::tile() 方法建立 tiled_extent 物件。呼叫 extent::tile() 的型別參數 <2,3>，表示建立 2x3 的 tiles 。因此，8x9 矩陣會拼貼成 12 個 tiles 、四個資料列和三個資料行。
透過 tiled_index<2,3> 物件 (t_idx) 做為索引呼叫 parallel_for_each 方法。索引的型別參數 (t_idx) 必須符合 compute domain 的型別參數 (descriptions.extent.tile< 2, 3>())。
每個執行緒執行時，索引 t_idx傳回執行緒是處於哪一個 tile 的資訊 (tiled_index::tile屬性)，以及執行緒於 tile 中的位置 (tiled_index::local 屬性)。

Tile 的同步處理 – tile_static 和 tile_barrier::wait

前一個範例說明 tile 的版面配置和索引，但本身並不是很實用。當 tiling 整合至演算法中並使用 tile_static 變數時才會變得實用。因為在 tile 中的所有執行緒都可以存取 tile_static 變數，所以 tile_barrier::wait 的呼叫是用來同步處理對 tile_static 變數的存取權。雖然在並排顯示中的所有執行緒都可以存取 tile_static 變數，但是不保證執行緒的執行順序。下列範例示範如何使用tile_static變數和tile_barrier::wait方法來計算每個 tile 的平均值。這是了解此範例的重點:

rawData 會儲存在 8x8 矩陣裡。
tile 大小為 2x2。這會藉由一 array 物件建立一 4x4 方格的 tiles ，且平均值可以儲存於一 4x4 的矩陣裡。您只能透過一個 AMP 限制的函數的參考來擷取部份的型別。array類別是其中一個。
因為予 array、 array_view、 extent和 tiled_index 的型別參數必須是常數值，所以矩陣大小和取樣大小是由 #define 陳述式定義。您也可以使用 const int static 宣告。此外，您可以輕易地透過修改取樣大小來計算 4x4 tiles 的平均值。
為每個 tile 都宣告tile_static 2x2 的浮點值的陣列。雖然宣告位於每個執行緒的程式碼的執行路徑上，仍然只會為每個 tile 在矩陣中建立一個陣列。
其中有一行程式碼複製每個 tile 中的值至 tile_static 陣列裡。對於每一條執行緒，在值複製到陣列後，會因呼叫 tile_barrier::wait 而停止執行。
平均值可以在一 tile 中所有的執行緒都到達 barrier 後計算得到。因為此程式碼對於每條執行緒都執行，所以有一行 if 陳述式使平均值只由一條執行緒計算。此平均值會儲存在變數「平均值」裡。barrier 是 tile 控制計算的必要結構，就像您可能會使用 for 迴圈。
因為 averages 變數是一 array 物件，因此其中的資料必須複製回予 host 。本範例會使用向量轉換運算子。
在完整的範例中，您可以將 SAMPLESIZE 變更為 4，而此程式碼將可以正確地執行，毋須做任何其他的修改。

#include <iostream>
#include <amp.h>
using namespace concurrency;

#define SAMPLESIZE 2
#define MATRIXSIZE 8
void SamplingExample() {

    // Create data and array_view for the matrix.
    std::vector<float> rawData;
    for (int i = 0; i < MATRIXSIZE * MATRIXSIZE; i++) {
        rawData.push_back((float)i);
    }
    extent<2> dataExtent(MATRIXSIZE, MATRIXSIZE);
    array_view<float, 2> matrix(dataExtent, rawData);

    // Create the array for the averages.
    // There is one element in the output for each tile in the data.
    std::vector<float> outputData;
    int outputSize = MATRIXSIZE / SAMPLESIZE;
    for (int j = 0; j < outputSize * outputSize; j++) {
        outputData.push_back((float)0);
    }
    extent<2> outputExtent(MATRIXSIZE / SAMPLESIZE, MATRIXSIZE / SAMPLESIZE);
    array<float, 2> averages(outputExtent, outputData.begin(), outputData.end());

    // Use tiles that are SAMPLESIZE x SAMPLESIZE.
    // Find the average of the values in each tile.
    // The only reference-type variable you can pass into the parallel_for_each call
    // is a concurrency::array.
    parallel_for_each(matrix.extent.tile<SAMPLESIZE, SAMPLESIZE>(),
         [=, &averages] (tiled_index<SAMPLESIZE, SAMPLESIZE> t_idx) restrict(amp) 
    {
        // Copy the values of the tile into a tile-sized array.
        tile_static float tileValues[SAMPLESIZE][SAMPLESIZE];
        tileValues[t_idx.local[0]][t_idx.local[1]] = matrix[t_idx];

        // Wait for the tile-sized array to load before you calculate the average.
        t_idx.barrier.wait();

        // If you remove the if statement, then the calculation executes for every
        // thread in the tile, and makes the same assignment to averages each time.
        if (t_idx.local[0] == 0 && t_idx.local[1] == 0) {
            for (int trow = 0; trow < SAMPLESIZE; trow++) {
                for (int tcol = 0; tcol < SAMPLESIZE; tcol++) {
                    averages(t_idx.tile[0],t_idx.tile[1]) += tileValues[trow][tcol];
                }
            }
            averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE * SAMPLESIZE);
        }
    });

    // Print out the results.
    // You cannot access the values in averages directly. You must copy them
    // back to a CPU variable.
    outputData = averages;
    for (int row = 0; row < outputSize; row++) {
        for (int col = 0; col < outputSize; col++) {
            std::cout << outputData[row*outputSize + col] << " ";
        }
        std::cout << "\n";
    }
    // Output for SAMPLESSIZE = 2 is:
    //  4.5  6.5  8.5 10.5
    // 20.5 22.5 24.5 26.5
    // 36.5 38.5 40.5 42.5
    // 52.5 54.5 56.5 58.5

    // Output for SAMPLESIZE = 4 is:
    // 13.5 17.5
    // 45.5 49.5
}

int main() {
    SamplingExample();
}

競爭情形

您可能會想要建立一個名為的 total 的變數 tile_static ，並利用它加總每條執行緒，如下所示:

// Do not do this.
tile_static float total;
total += matrix[t_idx];
t_idx.barrier.wait();
averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE * SAMPLESIZE);

這個方法的第一個問題是， tile_static 變數不能有初始設定式。第二個問題是，因為所有的執行緒都有 total 變數的存取權，且它們沒有特定的順序，所以對於該變數的指定構成了競爭條件。您可以撰寫演算法，在每個 barrier 只同時允許一條執行緒存取總和，如下列範例所示。儘管如此，本解決方案不具有延伸性。

// Do not do this.
tile_static float total;
if (t_idx.local[0] == 0 && t_idx.local[1] == 0) {
    total = matrix[t_idx];
}
t_idx.barrier.wait();

if (t_idx.local[0] == 0 && t_idx.local[1] == 1) {
    total += matrix[t_idx];
}
t_idx.barrier.wait();

// etc.

記憶體圍欄

有兩種記憶體存取方式，全域記憶體存取和 tile_static 記憶體存取，必須同步處理。一 concurrency::array 物件只會配置全域記憶體。一 concurrency::array_view 可以根據其建構方式，決定參考全域記憶體、 tile_static 記憶體，或者兩者兼具。需要同步處理的兩種記憶體:

全域記憶體
tile_static

記憶體圍欄可確保執行緒 tile 中的其他執行權也有記憶體的存取權，並且根據程式的順序決定其存取順序。若要確保這一點，編譯器和處理器不得重新排列圍欄間的讀寫順序。在 C + + AMP 裡，記憶體圍欄由呼叫下列其中一個方法來建立:

tile_barrier::wait 方法: 建立在全域以及 tile_static 記憶體周圍的圍欄。
tile_barrier::wait_with_all_memory_fence 方法: 建立在全域以及 tile_static 記憶體周圍的圍欄。
tile_barrier::wait_with_global_memory_fence 方法: 建立只在全域記憶體周圍的圍欄。
tile_barrier::wait_with_tile_static_memory_fence 方法: 建立只在 tile_static 記憶體周圍的圍欄。

呼叫您所需要的特定圍欄以改善應用程式的效能。barrier 型別會影響編譯器和硬體如何重新排列陳述式。例如，如果您使用全域記憶體圍欄，因其僅適用於全域記憶體存取，所以編譯器和硬體可能重新排列在圍欄兩側的 tile_static 變數的讀寫順序。

在下一個範例中， barrier 同步一 tile_static 變數 tileValues 的寫入。在此範例中，將改呼叫 tile_barrier::wait 為 tile_barrier::wait_with_tile_static_memory_fence 。

// Using a tile_static memory fence.
parallel_for_each(matrix.extent.tile<SAMPLESIZE, SAMPLESIZE>(),
     [=, &averages] (tiled_index<SAMPLESIZE, SAMPLESIZE> t_idx) restrict(amp) 
{
    // Copy the values of the tile into a tile-sized array.
    tile_static float tileValues[SAMPLESIZE][SAMPLESIZE];
    tileValues[t_idx.local[0]][t_idx.local[1]] = matrix[t_idx];

    // Wait for the tile-sized array to load before calculating the average.
    t_idx.barrier.wait_with_tile_static_memory_fence();

    // If you remove the if statement, then the calculation executes for every
    // thread in the tile, and makes the same assignment to averages each time.
    if (t_idx.local[0] == 0 && t_idx.local[1] == 0) {
        for (int trow = 0; trow < SAMPLESIZE; trow++) {
            for (int tcol = 0; tcol < SAMPLESIZE; tcol++) {
                averages(t_idx.tile[0],t_idx.tile[1]) += tileValues[trow][tcol];
            }
        }
        averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE * SAMPLESIZE);
    }
});

共用方式為

使用磚

全域、 Tile，以及區域索引的範例

Tile 的同步處理 – tile_static 和 tile_barrier::wait

競爭情形

記憶體圍欄

請參閱

參考

其他資源

其他資源