Partager via


Processing data with Hive

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Hive uses tables to impose a schema on data, and to provide a query interface for client applications. The key difference between Hive tables and those in traditional database systems, such as SQL Server, is that Hive adopts a “schema on read” approach that enables you to be flexible about the specific columns and data types that you want to project onto your data.

You can create multiple tables with different schemas from the same underlying data, depending on how you want to use that data. You can also create views and indexes over Hive tables, and partition tables. Moving data into a Hive-controlled namespace is usually an instantaneous operation.

You can use the Hive command line on the HDInsight cluster to work with Hive tables, and build an automated solution that includes Hive queries by using the HDInsight .NET SDKs and with a range of Hadoop-related tools such Oozie and WebHCat. You can also use the Hive ODBC driver to connect to Hive from any ODBC-capable client application.

The topics covered here are:

  • Creating tables with Hive
  • Managing Hive table data location and lifetime
  • Loading data into Hive tables
  • Partitioning the data
  • Querying tables with HiveQL

In addition to its more usual use as a querying mechanism, Hive can be used to create a simple data warehouse containing table definitions applied to data that you have already processed into the appropriate format. Azure storage is relatively inexpensive, and so this is a good way to create a commodity storage system when you have huge volumes of data. An example of this can be found in Scenario 2: Data warehouse on demand.

Creating tables with Hive

You create tables by using the HiveQL CREATE TABLE statement, which in its simplest form looks similar to the equivalent statement in Transact-SQL. You specify the schema in the form of a series of column names and types, and the type of delimiter that Hive will use to delineate each column value as it parses the data. You can also specify the format for the files in which the table data will be stored if you do not want to use the default format (where data files are delimited by an ASCII code 1 (Octal \001) character, equivalent to Ctrl + A). For example, the following code creates a table named mytable and specifies that the data files for the table should be tab-delimited.

CREATE TABLE mytable (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

You can also create a table and populate it as one operation by using a CREATE TABLE statement that includes a SELECT statement to query an existing table, as described later in this topic.

Hive supports a sufficiently wide range of data types to suit almost any requirement. The primitive data types you can use for columns in a Hive table are TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, DATE, TIMESTAMP, CHAR, VARCHAR, DECIMAL (though the last five of these are not available in older versions of Hive). In addition to these primitive types you can define columns as ARRAY, MAP, STRUCT, and UNIONTYPE. For more information see Hive Data Types in the Apache Hive language manual.

Managing Hive table data location and lifetime

Hive tables are simply metadata definitions imposed on data in underlying files. By default, Hive stores table data in the user/hive/warehouse/table_namepath in storage (the default path is defined in the configuration property hive.metastore.warehouse.dir), so the previous code sample will create the table metadata definition and an empty folder at user/hive/warehouse/mytable. When you delete the table by executing the DROP TABLE statement, Hive will delete the metadata definition from the Hive database and it will also remove the user/hive/warehouse/mytable folder and its contents.

Note

Table and column names are case-sensitive so, for example, the table named MyTable is not the same as the table mytable.

However, you can specify an alternative path for a table by including the LOCATION clause in the CREATE TABLE statement. The ability to specify a non-default location for the table data is useful when you want to enable other applications or users to access the files outside of Hive. This allows data to be loaded into a Hive table simply by copying data files of the appropriate format into the folder, or downloaded directly from storage. When the table is queried using Hive, the schema defined in its metadata is automatically applied to the data in the files.

An additional benefit of specifying the location is that this makes it easy to create a table for data that already exists in that location (perhaps the output from a previously executed map/reduce job or Pig script). After creating the table, the existing data in the folder can be retrieved immediately with a HiveQL query.

However, one consideration for using managed tables is that, when the table is deleted, the folder it references will also be deleted—even if it already contained other data files when the table was created. If you want to manage the lifetime of the folder containing the data files separately from the lifetime of the table, you must use the EXTERNAL keyword in the CREATE TABLE statement to indicate that the folder will be managed externally from Hive, as shown in the following code sample.

CREATE EXTERNAL TABLE mytable (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/mydata/mytable';

In HDInsight the location shown in this example corresponds to wasbs://[container-name]@[storage-account-name].blob.core.windows.net/mydata/mytable in Azure storage.

This ability to manage the lifetime of the table data separately from the metadata definition of the table means that you can create several tables and views over the same data, but each can have a different schema. For example, you may want to include fewer columns in one table definition to reduce the network load when you transfer the data to a specific analysis tool, but have all of the columns available for another tool.

As a general guide you should:

  • Use INTERNAL tables (the default, commonly referred to as managed tables) when you want Hive to manage the lifetime of the table or when the data in the table is temporary; for example, when you are running experimental or one-off queries over the source data.
  • Use INTERNAL tables and also specify the LOCATION for the data files when you want to access the data files from outside of Hive; for example, if you want to upload the data for the table directly into the Azure storage location.
  • Use EXTERNAL tables when you want to manage the lifetime of the data, when data is used by processes other than Hive, or if the data files must be preserved when the table is dropped. However, notice that cannot use EXTERNAL tables when you implicitly create the table by executing a SELECT query against an existing table.

Loading data into Hive tables

In addition to simply uploading data into the location specified for a Hive table, as described in the previous section of this topic, you can use the HiveQL LOAD statement to load data from an existing file into a Hive table. When the LOCAL keyword is included, the LOAD statement copies the file from the local file system to the folder associated with the table.

An alternative is to use a CREATE TABLE statement that includes a SELECT statement to query an existing table. The results of the SELECT statement are used to create data files for the new table. When you use this technique the new table must be an internal, non-partitioned table.

You can use an INSERT statement to insert the results of a SELECT query into an existing table, and in this case the table can be partitioned. You can use the OVERWRITE keyword with both the LOAD and INSERT statements to replace any existing data with the new data being inserted. You can use the OVERWRITE DIRECTORY keyword to effectively export the data returned by the SELECT part of the statement data to an output file.

As a general guide you should:

  • Use the LOAD statement when you need to create a table from the results of a map/reduce job or a Pig script. These scripts generate log and status files as well as the output file when they execute, and using the LOAD method enables you to easily add the output data to a table without having to deal with the additional files that you do not want to include in the table. Alternatively you can move the output file to a different location before you create a Hive table over it.
  • Use the INSERT statement when you want to load data from an existing table into a different table. A common use of this approach is to upload source data into a staging table that matches the format of the source data (for example, tab-delimited text). Then, after verifying the staged data, compress and load it into a table for analysis, which may be in a different format such as a SEQUENCE FILE.
  • Use a SELECT query in a CREATE TABLE statement to generate the table dynamically when you just want simplicity and flexibility. You do not need to know the column details to create the table, and you do not need to change the statement when the source data or the SELECT statement changes. You cannot, however, create an EXTERNAL or partitioned table this way and so you cannot control the data lifetime of the new table separately from the metadata definition.

To compress data as you insert it from one table to another, you must set some Hive parameters to specify that the results of the query should be compressed, and specify the compression algorithm to be used. The raw data for the table is in TextFile format, which is the default storage. However, compression may mean that Hadoop will not be able to split the file into chunks/blocks and run multiple map tasks in parallel—which can result in under-utilization of the cluster resources by preventing multiple map tasks from running concurrently. The recommended practice is to insert data into another table, which is stored in SequenceFile format. Hadoop can split data in SequenceFile format and distribute it across multiple map jobs.

For example, the following HiveQL statements load compressed data from a staging table into another table.

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
   STORED AS SEQUENCEFILE;
LOAD DATA LOCAL INPATH '/path/file.gz' INTO TABLE raw;
SET hive.exec.compress.output=true; 
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;

The value for io.seqfile.compression.type determines how the compression is performed. The options are NONE, RECORD, and BLOCK. RECORD compresses each value individually, while BLOCK buffers up 1MB (by default) before beginning compression.

Note

For more information about creating tables with Hive, see Hive Data Definition Language on the Apache Hive site. For a more detailed description of using Hive see Hive Tutorial.

Partitioning the data

Advanced options when creating a table include the ability to partition, skew, and cluster the data across multiple files and folders:

  • You can use the PARTITIONED BY clause to create a subfolder for each distinct value in a specified column (for example, to store a file of daily data for each date in a separate folder).
  • You can use the SKEWED BY clause to create separate files for rows where a specified column value is in a list of specified values. Rows with values not listed are stored together in a separate single file.
  • You can use the CLUSTERED BY clause to distribute data across a specified number of subfolders (described as buckets) based on hashes of the values of specified columns.

When you partition a table, the partitioning columns are not included in the main table schema section of the CREATE TABLE statement. Instead, they must be included in a separate PARTITIONED BY clause. The partitioning columns can, however, still be referenced in SELECT queries. For example, the following HiveQL statement creates a table in which the data is partitioned by a string value named partcol1.

CREATE EXTERNAL TABLE mytable (col1 STRING, col2 INT)
PARTITIONED BY (partcol1 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/mydata/mytable';

When data is loaded into the table, subfolders are created for each partition column value. For example, you could load the following data into the table.

col1

col2

partcol1

ValueA1

1

A

ValueA2

2

A

ValueB1

2

B

ValueB2

3

B

After this data has been loaded into the table, the /mydata/mytable folder will contain a subfolder named partcol1=A and a subfolder named partcol1=B, and each subfolder will contain the data files for the values in the corresponding partitions.

When you need to load data into a partitioned table you must include the partitioning column values. If you are loading a single partition at a time, and you know the partitioning value, you can specify explicit partitioning values as shown in the following HiveQL INSERT statement.

FROM staging_table s
INSERT INTO mytable PARTITION(partcol1='A')
SELECT s.col1, s.col2
WHERE s.col3 = 'A';

Alternatively, you can use dynamic partition allocation so that Hive creates new partitions as required by the values being inserted. To use this approach you must enable the non-strict option for the dynamic partition mode, as shown in the following code sample.

SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true; 

FROM staging_table s
INSERT INTO mytable PARTITION(partVal)
SELECT s.col1, s.col2, s.col3 partVal

Querying tables with HiveQL

After you have created tables and loaded data files into the appropriate locations you can query the data by executing HiveQL SELECT statements against the tables. HiveQL SELECT statements are similar to SQL, and support common operations such as JOIN, UNION, GROUP BY, and ORDER BY. For example, you could use the following code to query the mytable table described earlier.

SELECT col1, SUM(col2) AS total
FROM mytable
GROUP BY col1;

When designing an overall data processing solution with HDInsight, you may choose to perform complex processing logic in custom map/reduce components or Pig scripts and then create a layer of Hive tables over the results of the earlier processing, which can be queried by business users who are familiar with basic SQL syntax. However, you can use Hive for all processing, in which case some queries may require logic that is not possible to define in standard HiveQL functions.

In addition to common SQL semantics, HiveQL supports the use of:

  • Custom map/reduce scripts embedded in a query through the MAP and REDUCE clauses.
  • Custom user-defined functions (UDFs) that are implemented in Java, or that call Java functions available in the existing installed libraries. UDFs are discussed in more detail in the topic User defined functions.
  • XPath functions for parsing XML data using XPath. See Hive and XML File Processing for more information.

This extensibility enables you to use HiveQL to perform complex transformations on data as it is queried. To help you decide on the right approach, consider the following guidelines:

  • If the source data must be extensively transformed using complex logic before being consumed by business users, consider using custom map/reduce components or Pig scripts to perform most of the processing, and create a layer of Hive tables over the results to make them easily accessible from client applications.
  • If the source data is already in an appropriate structure for querying and only a few specific but complex transforms are required, consider using map/reduce scripts embedded in HiveQL queries to generate the required results.
  • If queries will be created mostly by business users, but some complex logic is still regularly required to generate specific values or aggregations, consider encapsulating that logic in custom UDFs because these will be simpler for business users to include in their HiveQL queries than a custom map/reduce script.

Note

For more information about selecting data from Hive tables, see Language Manual Select on the Apache Hive website. For some useful tips on using the SET command to configure headers and directory recursion in Hive see Useful Hive settings.

Next Topic | Previous Topic | Home | Community