在 HDInsight 的 Apache Hadoop 上搭配使用 C# 使用者定義函數與 Apache Hive 和 Apache Pig

發行項
12/04/2024

瞭解如何搭配使用 C# 使用者定義函式 (UDF) 與 Azure HDInsight 上的 Apache Hive 和 Apache Pig。

重要

此檔案作業的步驟需要以 Linux 為基礎的 Azure HDInsight 叢集。 Linux 是唯一使用於 HDInsight 3.4 版或更新版本的作業系統。如需詳細資訊，請參閱 HDInsight 元件版本設定。

Hive 和 Pig 都可以將資料傳遞至外部應用程式進行處理。這個程序稱為串流處理。當您使用 .NET 應用程式時，數據會傳遞至 STDIN 上的應用程式，而應用程式會在 STDOUT 上傳回結果。為了在 STDIN 和 STDOUT 讀取和寫入，您可以從主控台應用程式使用 Console.ReadLine() 和 Console.WriteLine()。

必要條件

熟悉如何撰寫和建置以 .NET Framework 4.5 為目標的 C# 程式碼。

使用您想要的任何 IDE。建議使用 Visual Studio 或 Visual Studio Code。本檔案中的步驟使用 Visual Studio 2019。
將 .exe 檔案上傳至叢集並執行 Pig 和 Hive 作業所採取的方式。建議使用 Data Lake Tools for Visual Studio、Azure PowerShell 和 Azure CLI。本文件中的步驟使用 Data Lake Tools for Visual Studio 上傳檔案並執行範例 Hive 查詢。

如需其他執行 Apache Hive 查詢方式的資訊，請參閱何謂 Azure HDInsight 上的 Apache Hive 和 HiveQL？。
HDInsight 叢集上的 Hadoop。如需有關建立叢集的詳細資訊，請參閱建立 HDInsight 叢集。

HDInsight 上的 .NET

以 Linux 為基礎的 HDInsight 叢集會使用 Mono (https://mono-project.com) 來執行 .NET 應用程式。 4.2.1 版的 Mono 隨附於 3.6 版的 HDInsight。

如需 Mono 與 .NET Framework 版本之相容性的詳細資訊，請參閱 Mono 相容性 \(英文\)。

如需 HDInsight 版本包含的 .NET Framework 和 Mono 版本的詳細資訊，請參閱 HDInsight 元件版本。

建立 C# 專案

下列各節說明如何在 Apache Hive UDF 和 Apache Pig UDF 的 Visual Studio 中建立 C# 專案。

Apache Hive UDF

如要建立 Apache Hive UDF 的 C# 專案：

啟動 Visual Studio。
選取 [建立新專案]。
在 [建立新專案] 視窗中，選擇 [主控台應用程式] (.NET Framework) 範本 (C# 版本)。然後選取下一步。
在 [設定新專案] 視窗中，輸入 HiveCSharp 的 [專案名稱]，然後瀏覽或建立 [位置] 以儲存新專案。然後選取建立。

在 Visual Studio IDE 中，以下列程式碼取代 Program.cs 的內容：

using System;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Tasks;

namespace HiveCSharp
{
    class Program
    {
        static void Main(string[] args)
        {
            string line;
            // Read stdin in a loop
            while ((line = Console.ReadLine()) != null)
            {
                // Parse the string, trimming line feeds
                // and splitting fields at tabs
                line = line.TrimEnd('\n');
                string[] field = line.Split('\t');
                string phoneLabel = field[1] + ' ' + field[2];
                // Emit new data to stdout, delimited by tabs
                Console.WriteLine("{0}\t{1}\t{2}", field[0], phoneLabel, GetMD5Hash(phoneLabel));
            }
        }
        /// <summary>
        /// Returns an MD5 hash for the given string
        /// </summary>
        /// <param name="input">string value</param>
        /// <returns>an MD5 hash</returns>
        static string GetMD5Hash(string input)
        {
            // Step 1, calculate MD5 hash from input
            MD5 md5 = System.Security.Cryptography.MD5.Create();
            byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
            byte[] hash = md5.ComputeHash(inputBytes);

            // Step 2, convert byte array to hex string
            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < hash.Length; i++)
            {
                sb.Append(hash[i].ToString("x2"));
            }
            return sb.ToString();
        }
    }
}

從功能表列中，選取 [組建]>[組建解決方案] 來組建應用程式。
關閉解決方案。

Apache Pig UDF

如要建立 Apache Hive UDF 的 C# 專案：

開啟 Visual Studio。
在 [開始] 視窗中，選取 [建立新專案]。
在 [建立新專案] 視窗中，選擇 [主控台應用程式] (.NET Framework) 範本 (C# 版本)。然後選取下一步。
在 [設定新專案] 視窗中，輸入 PigUDF 的 [專案名稱]，然後瀏覽或建立 [位置] 以儲存新專案。然後選取建立。

在 Visual Studio IDE 中，以下列程式碼取代 Program.cs 的內容：

using System;

namespace PigUDF
{
    class Program
    {
        static void Main(string[] args)
        {
            string line;
            // Read stdin in a loop
            while ((line = Console.ReadLine()) != null)
            {
                // Fix formatting on lines that begin with an exception
                if(line.StartsWith("java.lang.Exception"))
                {
                    // Trim the error info off the beginning and add a note to the end of the line
                    line = line.Remove(0, 21) + " - java.lang.Exception";
                }
                // Split the fields apart at tab characters
                string[] field = line.Split('\t');
                // Put fields back together for writing
                Console.WriteLine(String.Join("\t",field));
            }
        }
    }
}

此程式碼將剖析 Pig 送出的程式碼行，並將其重新格式化為以 java.lang.Exception 開頭的程式碼行。

從功能表列中，選擇 [組建]>[組建方案] 來組建應用程式。
讓解決方案在開啟狀態。

上傳至儲存體

接下來，將 Hive 和 Pig UDF 應用程式上傳至 HDInsight 叢集上的儲存體。

在 Visual Studio 中，瀏覽至 [檢視]>[伺服器總管]。
在 [伺服器總管] 中，用滑鼠右鍵按一下 [Azure]，選取 [連接到 Microsoft Azure 訂閱]，然後完成登入流程。
展開您要部署此應用程式的 HDInsight 叢集。就會列出含有文字 (預設儲存體帳戶) 的項目。
- 如果此項目可以展開，表示您是使用 Azure 儲存體帳戶 作為叢集的預設儲存體。若要檢視叢集之預設儲存體上的檔案，請展開項目，然後按兩下 [(預設容器)]。
- 如果此項目無法展開，表示您使用 Azure Data Lake Storage 作為叢集的預設儲存體。若要檢視叢集之預設儲存體上的檔案，請按兩下 [(預設儲存體帳戶)] 項目。
若要上傳 .exe 檔案，請使用下列其中一種方法：
- 如果您使用 Azure 儲存體帳戶，請選取 [上傳 blob] 圖示。
  
  在 [上傳新的檔案] 對話方塊的 [檔案名稱] 下，選取 [瀏覽]。在 [上傳 Blob] 對話框中，移至 bin\debug HiveCSharp 專案的資料夾，然後選擇HiveCSharp.exe檔案。最後，選取 [開啟]，然後選取 [確定] 以完成上傳。
- 如果使用 Azure Data Lake Storage，請以滑鼠右鍵按一下檔案清單中的空白區域，然後選取 [上傳]。最後，選取 HiveCSharp.exe 檔案並按一下 [開啟]。
HiveCSharp.exe 上傳完成後，請針對 PigUDF.exe 檔案重複上傳程序。

執行 Apache Hive 查詢

現在您可以執行使用 Hive UDF 應用程式的 Hive 查詢。

在 Visual Studio 中，瀏覽至 [檢視]>[伺服器總管]。
展開 [Azure]，然後展開 [HDInsight]。
以滑鼠右鍵按一下部署 HiveCSharp 應用程式的叢集，然後選取 [撰寫 Hive 查詢]。

在 Hive 查詢中使用下列文字：

-- Uncomment the following if you are using Azure Storage
-- add file wasbs:///HiveCSharp.exe;
-- Uncomment the following if you are using Azure Data Lake Storage Gen1
-- add file adl:///HiveCSharp.exe;
-- Uncomment the following if you are using Azure Data Lake Storage Gen2
-- add file abfs:///HiveCSharp.exe;

SELECT TRANSFORM (clientid, devicemake, devicemodel)
USING 'HiveCSharp.exe' AS
(clientid string, phoneLabel string, phoneHash string)
FROM hivesampletable
ORDER BY clientid LIMIT 50;

重要

將和使用於您叢集之預設儲存體類型相符的 add file 陳述式取消註解。

此查詢會從hivesampletable中選取 clientid、devicemake和devicemodel欄位，然後將這些欄位傳遞給 HiveCSharp.exe 應用程式。此查詢預期應用程式會傳回儲存為 clientid、phoneLabel 和 phoneHash 的三個欄位。此查詢也預期會在預設儲存體容器的根目錄中找到 HiveCSharp.exe。

將預設 Interactive 切換為 Batch，然後選取 [提交] 將作業提交至 HDInsight 叢集。 [Hive 作業摘要] 視窗隨即開啟。
按一下 [重新整理] 以重新整理摘要，直到 [作業狀態] 變更為 [已完成] 為止。若要檢視作業輸出，請按一下 [作業輸出]。

執行 Apache Pig 工作

您也可以執行使用 Pig UDF 應用程式的 Pig 作業。

使用 SSH 連線到 HDInsight 叢集。 (例如，執行命令 ssh sshuser@<clustername>-ssh.azurehdinsight.net。)如需詳細資訊，請參閱搭配 HDInsight 使用 SSH。
使用下列命令來啟動 Pig 命令列：
```
pig
```
grunt> 提示隨即顯示。
輸入下列命令，以執行使用 .NET Framework 應用程式的 Pig 作業：
```
DEFINE streamer `PigUDF.exe` CACHE('/PigUDF.exe');
LOGS = LOAD '/example/data/sample.log' as (LINE:chararray);
LOG = FILTER LOGS by LINE is not null;
DETAILS = STREAM LOG through streamer as (col1, col2, col3, col4, col5);
DUMP DETAILS;
```
DEFINE陳述式會為 pigudf.exe 應用程式建立別名streamer，然後CACHE會從叢集的預設儲存體載入它。稍後，streamer會與STREAM運算子搭配使用，進而處理包含LOG的單行，並以一連串資料行的方式傳回資料。

注意

建立別名時，用於串流處理的應用程式名稱兩邊必須加上 ` (倒引號) 字元，與 SHIP 搭配使用時，必須加上 ' (單引號) 字元。

輸入最後一行之後，作業應該會啟動。它會傳回類似下列文字的輸出：

(2019-07-15 16:43:25 SampleClass5 [WARN] problem finding id 1358451042 - java.lang.Exception)
(2019-07-15 16:43:25 SampleClass5 [DEBUG] detail for id 1976092771)
(2019-07-15 16:43:25 SampleClass5 [TRACE] verbose detail for id 1317358561)
(2019-07-15 16:43:25 SampleClass5 [TRACE] verbose detail for id 1737534798)
(2019-07-15 16:43:25 SampleClass7 [DEBUG] detail for id 1475865947)

使用exit結束 pig。

下一步

在本文中，您已學會如何從 HDInsight 上的 Hive 和 Pig 使用 .NET Framework 應用程式。如果您想要了解如何搭配使用 Python 與 Hive 和 Pig，請參閱搭配使用 Python 與 HDInsight 中的 Apache Hive 和 Apache Pig。

如需使用 Pig 和 Hive 的其他方式，以及如要了解如何使用 MapReduce，請參閱下列內容：

共用方式為