Export to Hive Query
Important
Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.
Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.
- See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
- Learn more about Azure Machine Learning.
ML Studio (classic) documentation is being retired and may not be updated in the future.
Note
Applies to: Machine Learning Studio (classic) only
Similar drag-and-drop modules are available in Azure Machine Learning designer.
This article describes how to use the Export data to Hive option in the Export Data module in Machine Learning Studio (classic). This option is useful when you are working with very large datasets, and want to save your machine learning experiment data to a Hadoop cluster or HDInsight distributed storage. You might also want to export intermediate results or other data to Hadoop so that you can process it using a MapReduce job.
How to export data to Hive
Add the Export Data module to your experiment. You can find this module in the Data Input and Output category in Machine Learning Studio (classic).
Connect the module to the dataset you want to export.
For Data source, select Hive Query.
For Hive table name type the name of the Hive table in which to store the dataset.
In the HCatalog server URI text box, type the fully qualified name of your cluster.
For example, if you created a cluster with the name
mycluster001
, use this format:https://mycluster001.azurehdinsight.net
In the Hadoop user account name text box, paste in the Hadoop user account that you used when you provisioned the cluster.
In the Hadoop user account password text box, type the credentials that you used when you provisioned the cluster.
For Location of output data, select the option that indicates where the data should be stored: HDFS, or Azure.
If the data is in the Hadoop distributed file system (HDFS), it must be accessible via the same account and password that you just entered.
If the data is in Azure, provide the location and credentials of the storage account.
If you selected the HDFS option, for HDFS server URI, specify the HDInsight cluster name without the
https://
prefix.If you selected the Azure option, provide the storage account name, and the credentials the module can use to connect to storage.
Azure storage account name: Type the name of the Azure account. For example, if the full URL of the storage account is
https://myshared.blob.core.windows.net
, you would typemyshared
.Azure storage key: Copy and paste the key that is provided for accessing the storage account.
Azure container name: Specify the default container for the cluster. For tips son how to figure out the default container, see the Technical notes section.
Use cached results: Select this option if you want to avoid rewriting the Hive table each time you run the experiment. If there are no other changes to module parameters, the experiment writes the Hive table only the first time the module is run, or when there are changes to the data.
If you want to write the Hive table each time the experiment is run, deselect the Use cached results option.
Run the experiment.
Examples
For examples of how to use the Export Data module, see the Azure AI Gallery.
- Advanced Analytics Process and Technology in Action: Using HDInsight Hadoop clusters: This article provides a detailed walkthrough of how to create a cluster, upload data, and call the data from Studio (classic) using Hive.
Technical notes
This section contains implementation details, tips, and answers to frequently asked questions.
Common questions
How to avoid out of memory problems when writing large datasets
Sometimes the default configuration of the Hadoop cluster is too limited to support running the MapReduce job. For example, in these Release Notes for HDInsight, the default settings are defined as a four-node cluster.
If the requirements of the MapReduce job exceed available capacity, the Hive queries might return an Out of Memory error message, which causes the Export Data operation to fail. If this happens, you can change the default memory allocation for Hive queries.
How to avoid re-loading the same data unnecessarily
If you don't want to recreate the Hive table each time you run the experiment, select the Use cached results option to TRUE. When this option is set to TRUE, the module will check whether the experiment has run previously, and if a previous run is found, the write operation is not performed.
Usage tips
It can be hard to figure out the default container for the cluster. Here are some tips:
If you created your cluster by using the default settings, a container with the same name was created at the same time that the cluster was created. That container is the default container for the cluster.
If you created the cluster by using the CUSTOM CREATE option, you were given two options for selecting the default container.
Existing container: If you selected an existing container, that container is the default storage container for the cluster.
Create default container: If you selected this option, a container with the same name as the cluster was created, and you should specify that container name as the default container for the cluster.
Module parameters
Name | Range | Type | Default | Description |
---|---|---|---|---|
Data source | List | Data Source Or Sink | Azure Blob Storage | Data source can be HTTP, FTP, anonymous HTTPS or FTPS, a file in Azure BLOB storage, an Azure table, an Azure SQL Database, a Hive table or an OData endpoint. |
Hive table name | any | String | none | Name of table in Hive |
HCatalog server URI | any | String | none | Templeton endpoint |
Hadoop user account name | any | String | none | Hadoop HDFS/HDInsight username |
Hadoop user account password | any | SecureString | none | Hadoop HDFS/HDInsight password |
Location of output data | any | DataLocation | HDFS | Specify HDFS or Azure for outputDir |
HDFS server URI | any | String | none | HDFS rest endpoint |
Azure storage account name | any | String | none | Azure storage account name |
Azure storage key | any | SecureString | none | Azure storage key |
Azure container name | any | String | none | Azure container name |
Use cached results | TRUE/FALSE | Boolean | FALSE | Module only executes if valid cache does not exist; otherwise use cached data from prior execution. |
Exceptions
Exception | Description |
---|---|
Error 0027 | An exception occurs when two objects have to be the same size, but they are not. |
Error 0003 | An exception occurs if one or more of inputs are null or empty. |
Error 0029 | An exception occurs when an invalid URI is passed. |
Error 0030 | an exception occurs in when it is not possible to download a file. |
Error 0002 | An exception occurs if one or more parameters could not be parsed or converted from the specified type to the type required by the target method. |
Error 0009 | An exception occurs if the Azure storage account name or the container name is specified incorrectly. |
Error 0048 | An exception occurs when it is not possible to open a file. |
Error 0046 | An exception occurs when it is not possible to create a directory on specified path. |
Error 0049 | An exception occurs when it is not possible to parse a file. |
For a list of errors specific to Studio (classic) modules, see Machine Learning Error codes.
For a list of API exceptions, see Machine Learning REST API Error Codes.
See also
Import Data
Export Data
Export to Azure SQL Database
Export to Azure Blob Storage
Export to Azure Table