Share via


Hadoop on Azure : Introduction

Introduction

5xeb23px

I am in complete awe on how this technology is resonating with today’s developers. If I invite developers for an evening event, Big Data is always a sellout.

This particular post is about getting everyone up to speed about what Hadoop is at a high level.

Big data is a technology that manages voluminous amount of unstructured and semi-structured data.

Relational databases fall short where data is very large or where the data doesn't map perfectly into a relational structure.

Big data is generally in the petabytes and exabytes of data.

  1. However, it is not just about the total size of data (volume)
  2. It is also about the velocity (how rapidly is the data arriving)
  3. What is the structure? Does it have variations?

Sources of Big Data

Science Scientists are regularly challenged by large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research.
Sensors Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks.
Social networks I am thinking of Facebook, LinkedIn, Yahoo, Google
Social influencers Blog comments, YELP likes, Twitter, Facebook likes, Apple's app store, Amazon, ZDNet, etc
Log files Computer and mobile device log files, web site tracking information, application logs, and sensor data. But there are also sensors from vehicles, video games, cable boxes or, soon, household appliances
Public Data Stores Microsoft Azure MarketPlace/DataMarket, The World Bank, SEC/Edgar, Wikipedia, IMDb
Data warehouse appliances Teradata, IBM Netezza, EMC Greenplum, which includes internal, transactional data that is already prepared for analysis
Network and in-stream monitoring technologies Packets in TCP/IP, email, etc
Legacy documents Archives of statements, insurance forms, medical record and customer correspondence

Two problems to solve

Storage Problem How do I store a petabyte of data reliably? Afterall, a petabyte is over 333 three TB drives.
Money Problem 1 petabyte costs a lot. For just 70 TB you will pay over $100,000. (eBay ad Dell/EMC CLARiiON CX3-40 -70TB- FAST 4G 15K SAN Storage is only 70 TB for $112,000)

Two seminal papers

The Google File System It is about a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients https://research.google.com/archive/gfs.html
MapReduce: Simplified Data Processing on Large Clusters MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key https://research.google.com/archive/mapreduce.html

Hadoop : What is it?

  1. Hadoop is an open-source software framework that supports data-intensive distributed applications. Hadoop is written in Java.
  2. I met its creator, Doug Cutting, who was working at Yahoo at the time. Hadoop is named after his son's toy elephant. I was hosting a booth at the time, and I remember Doug was curious about finding some cool stuff to bring home from the booth to give to his son. Another great idea, Doug!
  3. One of the goals of Hadoop is to run applications on large clusters of commodity hardware. The cluster is composed of a single master and multiple worker nodes.
  4. Hadoop leverages the the programming model of map/reduce. It is optimized for processing large data sets.
  5. MapReduce is typically used to do distributed computing on clusters of computer. A cluster had many “nodes,” where each node is a computer in a cluster.
  6. The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various slave or worker nodes in the cluster, and process the the data in parallel. Hadoop leverages a distributed file system to store the data on various nodes.

It is about two functions

Hadoop comes down to two functions. As long as you can write the map() and reduce() function, your data type is supported, whether we are talking abuot (1) text files (2) xml files (3) json files (4) even graphics, sound or video files.

The core is map() and reduce()
Understanding these methods is the key to mastering Hadoop
Map Step The map step is all about dividing the problem into smaller sub-problems. A master node has the job of distributing the work to worker nodes. The worker node just does one thing and returns the work back to the master node.
Reduce Step Once the master gets the work from the worker nodes, the reduce() step takes over and combines all the work. By combining the work you can form some answer and ultimately output.
Map Reduce Code
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465 public class WordCount {    public static class Map extends MapReduceBase      implements Mapper<LongWritable, Text, Text, IntWritable>      {      private final static IntWritable one = new IntWritable(1);      private Text word = new Text();      public void map(LongWritable key, Text value,                       OutputCollector<Text, IntWritable> output,                       Reporter reporter) throws IOException         {            String line = value.toString();            StringTokenizer tokenizer = new StringTokenizer(line);            while (tokenizer.hasMoreTokens())             {              word.set(tokenizer.nextToken());              output.collect(word, one);            }        }    }    public static class Reduce            extends MapReduceBase            implements Reducer<Text, IntWritable, Text, IntWritable>     {        public void reduce(Text key, Iterator<IntWritable> values,                          OutputCollector<Text, IntWritable> output,                          Reporter reporter) throws IOException         {            int sum = 0;            while (values.hasNext())             {               sum += values.next().get();            }            output.collect(key, new IntWritable(sum));        }    }    public static void main(String[] args) throws Exception     {      JobConf conf = new JobConf(WordCount.class);      conf.setJobName("wordcount");      conf.setOutputKeyClass(Text.class);      conf.setOutputValueClass(IntWritable.class);      conf.setMapperClass(Map.class);      conf.setCombinerClass(Reduce.class);      conf.setReducerClass(Reduce.class);      conf.setInputFormat(TextInputFormat.class);      conf.setOutputFormat(TextOutputFormat.class);      FileInputFormat.setInputPaths(conf, new Path(args[0]));      FileOutputFormat.setOutputPath(conf, new Path(args[1]));      JobClient.runJob(conf);    }}

The “map” in MapReduce

  1. There is a master node and many slave nodes.
  2. The master node takes the input, divides it into smaller sub-problems, and distributes the input to worker or slave nodes. worker node may do this again in turn, leading to a multi-level tree structure.
  3. The worker/slave nodes processes the data into a smaller problem, and passes the answer back to its master node.
  4. Each mapping operation is independent of the others, all maps can be performed in parallel.

The “reduce” in MapReduce

  1. The master node then collects the answers from the worker or slave nodes. It then aggregates the answers and creates the needed output, which is the answer to the problem it was originally trying to solve.
  2. Reducers can also preform the reduction phase in parallel. That is how the system can process petabytes in a matter of hours.

Their are 3 key methods

The map() function will generate a list of key/value pairs based on the data
The shuffle() phase will bring things together for the reduce() phase
The reduce() phase will take the list of key/value pairs and hand that to you to do something with.
  1. The Hello World sample for Hadoop is a word count example.
  2. Let's assume our quote is this:
    • It is time for all good men to come to the aid of their country.
    map() function (see the "to" part) finds "to" twice (It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (to, 1) (men, 1) (to, 1) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1)
    shuffle() function (see the "to" part) creates (to, 1, 1) (It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (to, 1, 1) (men, 1) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1)
    reduce() function (see the "to" part) creates (to, 2) (It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (men, 1) (to, 2) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1)

gsnjbqtb

High-level Architecture

zlkw4wsp

  1. There are two main layers to both the master node and the slave nodes – the MapReduce layer and the Distributed File System Layer. The master node is responsible for mapping the data to slave or worker nodes.

Hadoop is a platform

Hadoop Common The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS) A distributed file system that provides high-throughput access to application data.
Hadoop YARN A framework for job scheduling and cluster resource management.
Hadoop MapReduce A YARN-based system for parallel processing of large data sets.

There also related modules that are commonly associated with Hadoop.

Apache Pig A platform for analyzing large data sets.

It includes a high-level language for expressing data analysis programs

A key point of Pig programs is that they support substantial parallelization

Pig consists of a compiler that produces sequences of Map-Reduce programs

Pig's language layer currently consists of a textual language called Pig Latin

Hive Hive is a data warehouse system for Hadoop.It provides a SQL-like language.It helps with data summarization and ad-hoc queries. I am not sure yet whether this is required with Hadoop on Azure. If not, just a few command line tasks to do:
Install Hive tar -xzvf hive-x.y.z.tar.gz
Set the environment variable HIVE_HOME $ cd hive-x.y.z/ $ export HIVE_HOME={{pwd}}
Add $HIVE_HOME/bin to your PATH: $ export PATH=$HIVE_HOME/bin:$PATH
But from what I saw here, looks like there is an ODBC Hive Setup Module. WehnMing Ye has this video: https://channel9.msdn.com/Events/windowsazure/learn/Hadoop-on-Windows-Azure

I signed up

I recently signed up for the Windows Azure HDInsight Service here https://www.hadooponazure.com/.

Logging in

cezw1k2b

After logging in, you will be presented with this screen:

qf2laifm

Next post : Calculate PI with Hadoop

  1. We will create a job name called “Pi Example.” This very simple sample will calculate PI using a cluster of comptuers.
  2. This is not necessarily the best example of big data, it is more of a compute problem.
  3. The final command line will look like this:
    1. Hadoop jar hadoop-examples-0.20.203.1-SNAPSHOT.jar pi 16 10000000
  4. More details on this sample coming soon.