使用 TorchDistributor 散發 PyTorch 定型

6 分鐘

PyTorch 與其他深度學習架構 (例如 TensorFlow) 一樣，可在單一電腦上跨多個處理器 (CPU 或 GPU) 進行調整。多數情況下，這種使用處理器更多或更快速的電腦進行擴大的方法，可提供足夠的定型效能。

不過，當您需要使用複雜的神經網路或大量定型資料時，Apache Spark 可在多個背景工作節點間擴增處理工作的既有功能或許會有幫助。

Azure Databricks 使用可包含多個背景工作節點的 Spark 叢集。若要充分利用這些叢集，您可以使用 TorchDistributor，這是一個開放原始碼程式庫，可讓您將 PyTorch 定型作業分散到叢集中的節點。 TorchDistributor 可在 Databricks Runtime ML 13.0 和更新版本上使用。

當您已使用 PyTorch 定型模型時，可以透過下列方式使用 TorchDistributor，將單一程序定型轉換成分散式定型：

調整現有的程式碼：修改您的單一節點定型程式碼，使其與分散式定型相容。請確定您的定型邏輯會封裝在單一函式內。
在定型函式內移動 import：在定型函式內放置必要的 import 如 import torch，以避免常見的挑選錯誤。
準備定型函式：在定型函式中包含您的模型、最佳化工具、損失函式和定型迴圈。確定模型和資料已移至適當的裝置 (CPU 或 GPU)。
具現化並執行 TorchDistributor：使用所需的參數建立 TorchDistributor 的執行個體，並呼叫 .run(*args) 以啟動分散式定型。

調整現有的程式碼

首先，您必須修改您的單一節點定型程式碼，使其與分散式定型相容。當您修改程式碼時，您必須確定您的定型邏輯會封裝在單一函式內。 TorchDistributor 會使用此函式，將定型分散到多個節點。

import torch.nn as nn

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = nn.Linear(10, 1)
    
    def forward(self, x):
        return self.fc(x)

現在，您可以使用 torch.utils.data.DataLoader 來準備格式與 PyTorch 相容的資料集。

# Sample data
inputs = torch.randn(100, 10)
targets = torch.randn(100, 1)

# Create dataset and dataloader
from torch.utils.data import DataLoader, TensorDataset
dataset = TensorDataset(inputs, targets)
dataloader = DataLoader(dataset, batch_size=10)

在定型函式內移動 import

若要避免常見的挑選錯誤，請在定型函式內放置必要的 import 如 import torch。將所有 import 放在定型函式中，可確保當函式分散到多個節點時，所有必要模組都可供使用。

準備定型函式

在定型函式中包含您的模型、最佳化工具、損失函式和定型迴圈。確定模型和資料已移至適當的裝置 (CPU 或 GPU)。

def train_model(dataloader):
    import torch
    import torch.nn as nn
    from torch.optim import SGD

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = SimpleModel().to(device)
    optimizer = SGD(model.parameters(), lr=0.01)
    loss_fn = nn.MSELoss()
    
    for epoch in range(10):
        for batch in dataloader:
            inputs, targets = batch
            inputs, targets = inputs.to(device), targets.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, targets)
            loss.backward()
            optimizer.step()

具現化並執行 TorchDistributor

使用所需的參數建立 TorchDistributor 的執行個體，並呼叫 .run(*args) 以啟動分散式定型。執行 TorchDistributor 會將定型工作分散到多個節點。

from pyspark.ml.torch.distributor import TorchDistributor

# Distribute the training
distributor = TorchDistributor(num_workers=4)
distributor.run(train_model, dataloader)

監視和評估您的定型作業

您可以使用內建工具來監視叢集的效能，包括 CPU 或 GPU 使用量，以及記憶體使用率。定型完成時，您可以使用 PyTorch 評估技術在驗證或測試資料集時評估模型，以評估模型的效能。

# Evaluate the model (after distributed training is complete)
model.eval()
with torch.no_grad():
    for inputs, targets in dataloader:
        outputs = model(inputs)
        # Perform evaluation logic

提示