Udostępnij za pośrednictwem


Hive and XML File Processing

When I put together the “Generics based Framework for .Net Hadoop MapReduce Job Submission” code one of the goals was to support XML file processing. This was achieved by the creation of a modified Mahout document reader where one can specify the XML node to be presented for processing. But what if ones wants to process XML documents in Hive. Fortunately Hive similarly supports document readers, thus enabling the same document readers to be used as the basis of table definitions.

The process of enabling XML processing in Hive is relatively straightforward:

  • Create the table definition specifying that the input format is XML; thus exposing the necessary XML elements as columns
  • Parse the XML column data using xpath expressions in SELECT statements
  • or – Define a view on the XML table parsing out the relevant XML elements, returning them as native types

The syntax for the xpath processing in Hive can be found at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF

So onto a simple example.

In “SampleScripts” folder of the MapReduce Framework download there is a script that extracts Store information, in XML format, from the sample AdventureWorks database. A sample of the output is as follows:

  1. <Root>
  2.   <Store>
  3.     <BusinessEntityID>292</BusinessEntityID>
  4.     <Name>Next-Door Bike Store</Name>
  5.     <SalesPersonID>279</SalesPersonID>
  6.     <Demographics>
  7.       <StoreSurvey xmlns="https://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey">
  8.         <AnnualSales>800000</AnnualSales>
  9.         <AnnualRevenue>80000</AnnualRevenue>
  10.         <BankName>United Security</BankName>
  11.         <BusinessType>BM</BusinessType>
  12.         <YearOpened>1996</YearOpened>
  13.         <Specialty>Mountain</Specialty>
  14.         <SquareFeet>21000</SquareFeet>
  15.         <Brands>2</Brands>
  16.         <Internet>ISDN</Internet>
  17.         <NumberEmployees>13</NumberEmployees>
  18.       </StoreSurvey>
  19.     </Demographics>
  20.     <Modified>2008-10-13T11:15:07.497</Modified>
  21.   </Store>
  22.   ...
  23.   <Store>
  24.     <BusinessEntityID>374</BusinessEntityID>
  25.     <Name>Immense Manufacturing Company</Name>
  26.     <SalesPersonID>277</SalesPersonID>
  27.     <Demographics>
  28.       <StoreSurvey xmlns="https://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey">
  29.         <AnnualSales>3000000</AnnualSales>
  30.         <AnnualRevenue>300000</AnnualRevenue>
  31.         <BankName>Guardian Bank</BankName>
  32.         <BusinessType>OS</BusinessType>
  33.         <YearOpened>1998</YearOpened>
  34.         <Specialty>Touring</Specialty>
  35.         <SquareFeet>76000</SquareFeet>
  36.         <Brands>4+</Brands>
  37.         <Internet>DSL</Internet>
  38.         <NumberEmployees>73</NumberEmployees>
  39.       </StoreSurvey>
  40.     </Demographics>
  41.     <Modified>2008-10-13T11:15:07.497</Modified>
  42.   </Store>
  43. </Root>

 

The record reader to be used will be “XmlElementStreamingInputFormat”. This document reader using a configuration element to define the XML node to be located, which then outputs for each row a single column consisting of the complete node contents.

Using this record reader a table can be defined consisting of a single XML column:

add JARS file:///C:/Users/Carl/Projects/MSDN.Hadoop.MapReduce/Release/msdn.hadoop.readers.jar;
set xmlinput.element=Store;

CREATE EXTERNAL TABLE StoresXml (storexml string)
STORED AS INPUTFORMAT 'msdn.hadoop.mapreduce.input.XmlElementStreamingInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/Carl/stores/demographics';

The INPUTFORMAT option allows for the definition of the required document reader. The OUTPUTFORMAT specified is the default for Hive. In this example I have defined an EXTERNAL table over the directory containing the extracted XML; independently copied to the Hadoop cluster.

The ADD JARS statement ensures the document reader is available for job execution. The SET statement configures the job such that the document reader knows what XML node to extract.

Once this table is defined you can use it like any other Hive table. If one selects from this table you will get each Store element as a row. However, xpath processing allows you to extract the XML attributes as native types.

In addition to SELECT operations one also has the option of creating a VIEW that parses the XML and presents the data using native types:

CREATE VIEW Stores(BusinessEntityID, BusinessType, BankName, AnnualSales, AnnualRevenue) AS
SELECT
xpath_int (storexml, '/Store/BusinessEntityID'),
xpath_string (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'BusinessType\']'),
xpath_string (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'BankName\']'),
xpath_double (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'AnnualSales\']'),
xpath_double (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'AnnualRevenue\']')
FROM StoresXml;

Using the Stores definition one can now process the XML data files through the normal Hive operations. Continuing with the same samples in the download one can now easily generate a revenue summary across the banks:

SELECT BusinessType, BankName, CAST(SUM(AnnualSales) AS INT) AS TotalSales FROM Stores
GROUP BY BusinessType, BankName;

As expected, under the covers the necessary MapReduce jobs are executed to aggregate the data, returning:

BM Guardian Bank 43200000
BM International Bank 43200000
BM International Security 43200000
BM Primary Bank & Reserve 42800000
BM Primary International 42200000
BM Reserve Security 42200000
BM United Security 42200000
BS Guardian Bank 88400000
BS International Bank 87400000
BS International Security 88400000
BS Primary Bank & Reserve 88400000
BS Primary International 87400000
BS Reserve Security 87400000
BS United Security 87400000
OS Guardian Bank 192000000
OS International Bank 186000000
OS International Security 186000000
OS Primary Bank & Reserve 186000000
OS Primary International 186000000
OS Reserve Security 186000000
OS United Security 186000000

If you download the aforementioned code, the sample for the Hive execution can be found in the “SampleScripts” folder. The “DocumentInputReaders” folder also contains the XML document reader classes along with a usable JAR file.

Comments

  • Anonymous
    February 07, 2013
    Carl,I have a question about this class org.apache.mahout.classifier.bayes.XmlElementStreamingInputFormatWhere did it come from?  I can't see it anywhere in the Mahout source distribution.Is this a class that you wrote?  If so, it would really be better if you didn't use the org.apache.mahout package, if only for the sake of your readers.Feel free to email me if you have questions at tdunning at apache.org or on the Mahout dev mailing list.

  • Anonymous
    February 22, 2013
    The comment has been removed

  • Anonymous
    March 03, 2013
    The error relating to java.lang.UnsupportedClassVersionError: org/apache/mahout/classifier/bayes/XmlElementStreamingInputFormat : Unsupported major.minor version 51.0 could be down to the current code being compiled with the Oracle SDK. Are you running this on Azure. If so you may need to recompile the Java classes. There is a script to do this, just point it to the correct javac exe.

  • Anonymous
    February 24, 2014
    Is there a way to get the "xmlinput.element" property set in the TBLPROPERTIES during the CREATE?  Something similar to how the mahout XmlInputFormat does with "xmlinput.start" and "xmlinput.end"?

  • Anonymous
    October 22, 2014
    What happens if there is a text node that is inside "store" but untagged? How would you make this into a field?

  • Anonymous
    July 07, 2015
    Dear Good Post. "Add JAR", "SET", "CREATE TABLE" and "LOAD FILE" worked without any error.The loaded file exist in external table path mentioned in create table but select * from table name did not return any rows. Any idea what am I doing wrong? Regards Mani

  • Anonymous
    December 20, 2015
    How to download Microsoft Hive Serde for XML Parsing ? Please provide the "Download Link".