Transform data using Hadoop Hive activity in Azure Data Factory or Synapse Analytics

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Tip

Try out Data Factory in Microsoft Fabric, an all-in-one analytics solution for enterprises. Microsoft Fabric covers everything from data movement to data science, real-time analytics, business intelligence, and reporting. Learn how to start a new trial for free!

The HDInsight Hive activity in an Azure Data Factory or Synapse Analytics pipeline executes Hive queries on your own or on-demand HDInsight cluster. This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities.

If you are new to Azure Data Factory and Synapse Analytics, read through the introduction articles for Azure Data Factory or Synapse Analytics, and do the Tutorial: transform data before reading this article.

Add an HDInsight Hive activity to a pipeline with UI

To use an HDInsight Hive activity for Azure Data Lake Analytics in a pipeline, complete the following steps:

  1. Search for Hive in the pipeline Activities pane, and drag a Hive activity to the pipeline canvas.

  2. Select the new Hive activity on the canvas if it is not already selected.

  3. Select the HDI Cluster tab to select or create a new linked service to an HDInsight cluster that will be used to execute the Hive activity.

    Shows the UI for a Hive activity.

  4. Select the Script tab to select or create a new storage linked service, and a path within the storage location, which will host the script.

    Shows the UI for the Script tab for a Hive activity.

Syntax

{
    "name": "Hive Activity",
    "description": "description",
    "type": "HDInsightHive",
    "linkedServiceName": {
        "referenceName": "MyHDInsightLinkedService",
        "type": "LinkedServiceReference"
    },
    "typeProperties": {
        "scriptLinkedService": {
            "referenceName": "MyAzureStorageLinkedService",
            "type": "LinkedServiceReference"
        },
        "scriptPath": "MyAzureStorage\\HiveScripts\\MyHiveScript.hql",
        "getDebugInfo": "Failure",
        "arguments": [
            "SampleHadoopJobArgument1"
        ],
        "defines": {
            "param1": "param1Value"
        }
    }
}

Syntax details

Property Description Required
name Name of the activity Yes
description Text describing what the activity is used for No
type For Hive Activity, the activity type is HDinsightHive Yes
linkedServiceName Reference to the HDInsight cluster registered as a linked service. To learn about this linked service, see Compute linked services article. Yes
scriptLinkedService Reference to an Azure Storage Linked Service used to store the Hive script to be executed. Only Azure Blob Storage and ADLS Gen2 linked services are supported here. If you don't specify this Linked Service, the Azure Storage Linked Service defined in the HDInsight Linked Service is used. No
scriptPath Provide the path to the script file stored in the Azure Storage referred by scriptLinkedService. The file name is case-sensitive. Yes
getDebugInfo Specifies when the log files are copied to the Azure Storage used by HDInsight cluster (or) specified by scriptLinkedService. Allowed values: None, Always, or Failure. Default value: None. No
arguments Specifies an array of arguments for a Hadoop job. The arguments are passed as command-line arguments to each task. No
defines Specify parameters as key/value pairs for referencing within the Hive script. No
queryTimeout Query timeout value (in minutes). Applicable when the HDInsight cluster is with Enterprise Security Package enabled. No

Note

The default value for queryTimeout is 120 minutes.

See the following articles that explain how to transform data in other ways: