Sdílet prostřednictvím


Introduction to #Hive Collections

After a much needed vacation in the sunny Florida Keys and some time away from the work and blogosphere world, its time to get back on the hamster wheel.

Like most RDBMS systems Hive supports a number of different primitive data types including various size integers, precision floating point, boolean, timestamp and of course strings. Beyond those basics, Hive supports three data types not typically found in other database systems. These three data types: array, struct and map are collection implementations which are highly beneficial in the world of big data where traditional rules of normal forms do not apply.

So in the spirit of my favorite season of the year.....college football season....we are going to use a college football example to introduce the basics for each of these three data types before discussing how they are declared and subsequently queried in Hive.

Arrays

An array is an ordered or sequenced collection of items whose data types are all the same such as strings or integers. Looking at a pseudo-coded example this would look something like this:

 Array('University of Maryland', 'University of Kentucky', 'Texas A&M University', 'University of Alabama')

Since this collection of "things" has a defined order or sequence it is addressable via a zero-based index meaning that University of Maryland would be at Array[0] and University of Alabama would be found at Array[3].

Struct

For those of us who come from a developer background this one will look familiar. A struct or structure is essentially an "object" that  consists of defined fields with each field having its own data type. We could define a name struct that looks like:

 Struct(FirstName: string, NickName: string, LastName:string)

The pseudo-coded example that follows is simply:

 Struct('Paul', 'Bear', 'Bryant')

To access fields in a structure, the dot notation that is found in most programming languages is used. Using the above example the coaches name could be accessed using to produce the name of the legendary coach Paul Bear Bryant:

 Struct.FirstName + ' ' + Struct.NickName + ' ' + Struct.LastName

Map

The Map data type is nothing more than a collection of key/value pairs or tuples defined as:

 Map<string, integer>

The first data type (the string) defines the type used for the key and the second data type type for the value. Using this collection we could populate something like a win, loss and tie record using the following pseudo-code:

 Map('wins', 323, 'losses', 85, 'ties', 17)

Access to items within the collection is handled through what's called array notation. For example suppose you wanted to get the number of wins from the map, you would use Map['wins'] to get 323.

Declaring Collections

Now that you have an idea of the data types that make up the collection support within Hive. Let's look at a full table definition that declares each type previously discussed:

 CREATE TABLE coaches (
    id     INTEGER,
    name   STRUCT<firstname:STRING, nickname:STRING, lastname:STRING>,
    dob    TIMESTAMP,
    teams  ARRAY<STRING>,
    record MAP<STRING, INTEGER>    
);

Note that to this point, we have not defined any encoding or delimiter information which is important since the Hive applies the schema on read. Before looking at an expanded definition that includes a delimiter specification, review the default Hive delimiters in the table below.

Delimiter Code Description
\n \n Record or row delimiter
^A 01 Field delimiter
^B 02 Element delimiter in ARRAYs and STRUCTs
^C 03 Delimits key/value pairs in a MAP

Using the default delimiters, the full table declaration looks like (Side Note: its not necessary to specify these declarations if you are not overriding the default delimiters):

 CREATE TABLE coaches (
    id     INTEGER,
    name     STRUCT<firstname:STRING, nickname:STRING, lastname:STRING>,
    dob     TIMESTAMP,
    teams     ARRAY<STRING>,
    record    MAP<STRING, INTEGER>    
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '01' 
COLLECTION ITEMS TERMINATED BY '02' 
MAP KEYS TERMINATED BY '03' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE;

Although repetitive, its worth repeating at this point that Hive applies schema on read. The above schema is only as good as the formatting of the underlying data. If the format of the data does not map to the specified schema, Hive handles this by returning NULLs. A sample of the expected format for the above sample is provided below (line breaks added for clarity):

 Pauk^BBear^BBryant^A09/11/1913^A
University of Maryland^BUniversity of Kentucky^B
Texas A&M University^BUniversity of Alabama^A
323^C85^C^B17^A

Querying Collections

Accessing collections via a query is a straight-forward matter. The entire collection can be dumped or projected simply by referencing the field or column name as is typical in a normal RDBMS query:

 SELECT
    id,
    name,
    dob,
    teams,
    record
FROM coaches

The results of the above query are returned as normal with the three collection data types output in JSON format. The example output for each data type is detailed in the following table:

Field Type Example
name struct {"firstname":"Paul", "nickname":"Bear", "lastname":"Bryant"}
teams array ["University of Maryland", "University of Kentucky", "Texas A&M University", "University of Alabama"]
record map {"wins":323,"losses":85,"ties":17}

While dumping an entire collection is useful in some scenarios, its more typical during the course of a query to extract either a subset or an individual item out of a collection. Hive supports this use case in a couple of different forms:

  1. ARRAY data types can be queried using a zero-based array notation. The query below returns the item index 3 from the collection which happens to be the 4th item of the team array or the University of Alabama. If you request an index that does not exists, such as teams[10] no error will be raised and a simple NULL is returned.

     SELECT
        teams[3]
    FROM coaches
    
  2. The MAP data type can also be queried using the array notation. The primary difference is that instead of using an ordinal index to access the collection members, the map keys are passed. To return the Bear Bryant's career wins, you could use the following query:

     SELECT
        record["wins"]
    FROM coaches
    
  3. The STRUCT data type as mentioned earlier in the blog can be accessed through dot notation. To access the individual fields of the collection or object the format takes the form of <Column Name>.<Field Name> as seen below.

     SELECT
        name.nickname
    FROM coaches
    

Wrap-Up

In this post, we introduced the built-in collection support available in Hive. Collections can be implemented through the ARRAY, MAP and STRUCT data types. You can control both over how Hive parses data by customizing the column or field delimiters and how data can be retrieved or projected out. In the next post, we will dig a little deeper into Hive as we look at partitioning and bucketing.