Share via


Azure Hadoop: Pegasus Page Rank Sample

Overview

This tutorial shows how to deploy Pegasus from the Hadoop on Azure portal to compute the page rank for a simple 16-node graph. The rank calculated for a node is a measure of how well connected it is to the other nodes in the graph structure.

A graph is type of abstract mathematical structure that consists of a collection of nodes and a collection of edges that connect a subset of these nodes, pairwise. The Web is a model of a graph structure, where pages are nodes and hyperlinks are (directed) edges. The page rank of a page (node) is a measure of how many other pages have hyperlinks (direct edges) that target that page (node). The higher the value of a page's rank, the more highly connected it is to other pages on the Web. A high page rank typically indicates an important page. The page rank of a page is defined recursively, so highly ranked pages that link to it increase its rank more that poorly ranked pages do.

Pegasus

Pegasus is an open source graph mining library implemented in a distributed manner on top of Hadoop. Pegasus provides large scale algorithms for various graph mining tasks:

  • Degree
  • PageRank
  • Random Walk with Restart
  • Radius
  • Connected Components

This form of analysis is applicable to many networked structures other than the Web, such as computer and social networks, that model a graph. People from School of Computer Science, Carnegie Mellon University developed Pegasus. For more information, see the Pegasus Project site.

Goals

In this tutorial you see three things:

  1. How Pegasus input and output files are structured.

  2. How to use the Hadoop on Azure to deploy a Pegasus page rank analysis.

  3. How to use the Interactive Console in Hadoop on Azure to examine the results computated by Pegasus for the page rankings of the nodes.

Key technologies

Setup and configuration

You must have an account to access Hadoop on Azure and have created a cluster to work through this tutorial. To obtain an account and create an Hadoop cluster, follow the instructions outlined in the Getting started with Microsoft Hadoop on Azure section of the Introduction to Hadoop on Azure topic.

Tutorial

This tutorial is composed of the following segments:

  1. How to clean up a previous deployment of the Pegasus Pagerank algorithm from Hadoop on Azure.
  2. How to deploy the Pagerank algorithm in Pegasus from Hadoop on Azure.
  3. How to inspect the output from the Pegasus Pagerank algorithm in the Interactive Console of Hadoop on Azure.

How to clean up a previous deployment of the Pegasus Pagerank algorithm from Hadoop on Azure

This segment is only needed if you have already completed a deployment of the Pegasus page rank algorithm on the current Hadoop cluster.

From your Account page, scroll down to the Interactive Console icon in the Your cluster section and click the icon to open the console.

Enter the following commands at the js> prompt to delete the output directories from the previous job. If these directories exist, a new job fails.

js> #rmr pr_tempmv  
js> #rmr pr_output  
js> #rmr pr_minmax  
js> #rmr pr_distr   

How to deploy the Pagerank algorithm in Pegasus from Hadoop on Azure

From your Account page, scroll down to the Samples icon in the Manage your account section and click it to get to the Samples Gallery. 

Click the Pegasus Pagerank sample icon in the Hadoop Sample Gallery to open the Deployment page for the sample. 

The catepillar_star.edge file is the input file that contains the graph to be analyzed by Pegasus. Download it and open it with Notepad or any other program that opens text files. Each line specifies an edge in the graph. The format for a line is: source node Id followed by TAB followed by destination node Id. So if this file was for a graph of Web pages, the first line says that there is a hyperlink from the page with Id = 0 to the page with Id = 1.

Click the Deploy to your cluster to deploy the sample and bring up the Create Job page. 

The Final Command, which begins with
"Hadoop jar pegasus-2.0.jar", takes 9 parameters. To run the runpr.sh command, Pegasus shell uses only the parameters that are highlighted in bold in the following parameter list. The first (pegasus.PagerankNaive) parameter specifies the PageRank-plain algorithm that uses runpr.sh command. The Hadoop on Azure portal require the others parameters.

  • pegasus.PagerankNaive - the Pegasus algorithm
  • /user/SYSTEM/graph - HDFS directory where the input edge file is located.
  • pr_tempmv - temp output directory.
  • pr_output - output directory.
  • 16 - the number of nodes.
  • 1 - the number of reducers.
  • 1024 - the number of iterations.
  • nosym - for directed edges, use makesym for undirected edges.
  • new - new to start a calculation, contNN to continue one.

Click the Execute job button to start the job. When the job finishes, the Status at the Job Info section at the top of the page has the value "completed Successfully".

How to inspect the output from the Pegasus Pagerank algorithm in the Interactive Console of Hadoop on Azure

Return to your Account page, scroll down to the Interactive Console icon in the Your cluster section and click on the icon to open the console.

The output directories for the Pegasus Pagerank job are pr_vector, pr_minmax, and pr_dist. The pr_vector contains the page rank for each node. Each line has the format: node Id followed by a TAB followed by a "v" followed by a pagerank value. To locate the directory, use the list command #ls. Then to see the files in the pr_vector directory use the #ls /user/bradsev/pr_vector. The results are in the part-00000 file, so use the #cat pr_vector/part-00000 command to see the pagerank value for each node. Node 6 has the highest value.

The Pr_minmax output directory contains the minimum and the maximum PageRanks for the graph and may be similarly inspected. The minimum PageRank is the second column of the line that starts with "0". The maximum PageRank is the second column of the line that starts with "1". So the minimum value found here is 0.009374998509883881 and the maximum value is 0.024460836507193662, as shown in the following screenshot.

The Pr_distr output directory contains a histogram of the Pageranks for the graph. The histogram of Pagerank. It divides the range of (min_PageRank, max_PageRank) into bins and shows the number of nodes which have Pageranks that belong to such bins. The bins are defined in the first column and number of nodes in the second column. You can inspect the output file be using the #cat pr_distr/part-00000 command.

Summary

In this tutorial, you have seen how Pegasus input and output files for a page rank computation are structured, how to use the Hadoop on Azure to deploy a Pegasus page rank analysis, and how to use the Interactive Console in Hadoop on Azure to examine the results.