Running Apache Mahout at Hadoop on Windows Azure (www.hadooponazure.com)
Once you have access enabled to Hadoop on Windows Azure you can run any mahout sample on head node. I am just trying to run original Apache Mahout (https://mahout.apache.org/) sample which is derived from the clustering sample on Mahout's website (https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data).
Step 1: Please RDP to your head node and open the Hadoop command line window.
Here you can just launch MAHOUT to see what happens
Step 2: Download necessary data file from the Internet:
Please download Synthetic control data from https://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data and place it under c:\apps\dist\mahout\examples\bin\work\synthetic_control.data"
Step 3: Go to folder c:\apps\dist\mahout\examples\bin and Run command "build-cluster-syntheticcontrol.cmd" and select the desired clustering algorithm from the driver script.
c:\Apps\dist\mahout\examples\bin>build-cluster-syntheticcontrol.cmd
"Please select a number to choose the corresponding clustering algorithm"
"1. canopy clustering"
"2. kmeans clustering"
"3. fuzzykmeans clustering"
"4. dirichlet clustering"
"5. meanshift clustering"
Enter your choice:1
"ok. You chose 1 and we'll use canopy Clustering"
"DFS is healthy... "
"Uploading Synthetic control data to HDFS"
rmr: cannot remove testdata: No such file or directory.
"Successfully Uploaded Synthetic control data to HDFS "
"Running on hadoop, using HADOOP_HOME=c:\Apps\dist"
c:\Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver org.apache.mahout.clustering.synthet
iccontrol.canopy.Job
12/03/06 00:50:10 WARN driver.MahoutDriver: No org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props found on classpath, will use command-lin
e arguments only
12/03/06 00:50:10 INFO canopy.Job: Running with default arguments
12/03/06 00:50:17 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/06 00:50:18 INFO input.FileInputFormat: Total input paths to process : 1
12/03/06 00:50:20 INFO mapred.JobClient: Running job: job_201203052259_0001
12/03/06 00:50:21 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 00:51:00 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 00:51:11 INFO mapred.JobClient: Job complete: job_201203052259_0001
12/03/06 00:51:11 INFO mapred.JobClient: Counters: 16
12/03/06 00:51:11 INFO mapred.JobClient: Job Counters
12/03/06 00:51:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=33969
12/03/06 00:51:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/06 00:51:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/03/06 00:51:11 INFO mapred.JobClient: Launched map tasks=1
12/03/06 00:51:11 INFO mapred.JobClient: Data-local map tasks=1
12/03/06 00:51:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/03/06 00:51:11 INFO mapred.JobClient: File Output Format Counters
12/03/06 00:51:11 INFO mapred.JobClient: Bytes Written=335470
12/03/06 00:51:11 INFO mapred.JobClient: FileSystemCounters
12/03/06 00:51:11 INFO mapred.JobClient: FILE_BYTES_READ=130
12/03/06 00:51:11 INFO mapred.JobClient: HDFS_BYTES_READ=288508
12/03/06 00:51:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21557
12/03/06 00:51:11 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=335470
12/03/06 00:51:11 INFO mapred.JobClient: File Input Format Counters
12/03/06 00:51:11 INFO mapred.JobClient: Bytes Read=288374
12/03/06 00:51:11 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 00:51:11 INFO mapred.JobClient: Map input records=600
12/03/06 00:51:11 INFO mapred.JobClient: Spilled Records=0
12/03/06 00:51:11 INFO mapred.JobClient: Map output records=600
12/03/06 00:51:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=134
12/03/06 00:51:11 INFO canopy.CanopyDriver: Build Clusters Input: output/data Out: output Measure: org.apache.mahout.common.distance.EuclideanDistance
Measure@1997c1d8 t1: 80.0 t2: 55.0
12/03/06 00:51:11 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/06 00:51:12 INFO input.FileInputFormat: Total input paths to process : 1
12/03/06 00:51:13 INFO mapred.JobClient: Running job: job_201203052259_0002
12/03/06 00:51:14 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 00:51:58 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 00:52:16 INFO mapred.JobClient: map 100% reduce 100%
12/03/06 00:52:27 INFO mapred.JobClient: Job complete: job_201203052259_0002
12/03/06 00:52:27 INFO mapred.JobClient: Counters: 25
12/03/06 00:52:27 INFO mapred.JobClient: Job Counters
12/03/06 00:52:27 INFO mapred.JobClient: Launched reduce tasks=1
12/03/06 00:52:27 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=30345
12/03/06 00:52:27 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/06 00:52:27 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/03/06 00:52:27 INFO mapred.JobClient: Launched map tasks=1
12/03/06 00:52:27 INFO mapred.JobClient: Data-local map tasks=1
12/03/06 00:52:27 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15968
12/03/06 00:52:27 INFO mapred.JobClient: File Output Format Counters
12/03/06 00:52:27 INFO mapred.JobClient: Bytes Written=6615
12/03/06 00:52:27 INFO mapred.JobClient: FileSystemCounters
12/03/06 00:52:27 INFO mapred.JobClient: FILE_BYTES_READ=14296
12/03/06 00:52:27 INFO mapred.JobClient: HDFS_BYTES_READ=335597
12/03/06 00:52:27 INFO mapred.JobClient: FILE_BYTES_WRITTEN=73063
12/03/06 00:52:27 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=6615
12/03/06 00:52:27 INFO mapred.JobClient: File Input Format Counters
12/03/06 00:52:27 INFO mapred.JobClient: Bytes Read=335470
12/03/06 00:52:27 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 00:52:27 INFO mapred.JobClient: Reduce input groups=1
12/03/06 00:52:27 INFO mapred.JobClient: Map output materialized bytes=13906
12/03/06 00:52:27 INFO mapred.JobClient: Combine output records=0
12/03/06 00:52:27 INFO mapred.JobClient: Map input records=600
12/03/06 00:52:27 INFO mapred.JobClient: Reduce shuffle bytes=0
12/03/06 00:52:27 INFO mapred.JobClient: Reduce output records=6
12/03/06 00:52:27 INFO mapred.JobClient: Spilled Records=50
12/03/06 00:52:27 INFO mapred.JobClient: Map output bytes=13800
12/03/06 00:52:27 INFO mapred.JobClient: Combine input records=0
12/03/06 00:52:27 INFO mapred.JobClient: Map output records=25
12/03/06 00:52:27 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
12/03/06 00:52:27 INFO mapred.JobClient: Reduce input records=25
12/03/06 00:52:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/06 00:52:27 INFO input.FileInputFormat: Total input paths to process : 1
12/03/06 00:52:28 INFO mapred.JobClient: Running job: job_201203052259_0003
12/03/06 00:52:29 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 00:53:46 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 00:58:20 INFO mapred.JobClient: Job complete: job_201203052259_0003
12/03/06 00:58:20 INFO mapred.JobClient: Counters: 16
12/03/06 00:58:20 INFO mapred.JobClient: Job Counters
12/03/06 00:58:20 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=30407
12/03/06 00:58:20 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/06 00:58:20 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/03/06 00:58:20 INFO mapred.JobClient: Rack-local map tasks=1
12/03/06 00:58:20 INFO mapred.JobClient: Launched map tasks=1
12/03/06 00:58:20 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/03/06 00:58:20 INFO mapred.JobClient: File Output Format Counters
12/03/06 00:58:20 INFO mapred.JobClient: Bytes Written=340891
12/03/06 00:58:20 INFO mapred.JobClient: FileSystemCounters
12/03/06 00:58:20 INFO mapred.JobClient: FILE_BYTES_READ=130
12/03/06 00:58:21 INFO mapred.JobClient: HDFS_BYTES_READ=342212
12/03/06 00:58:21 INFO mapred.JobClient: FILE_BYTES_WRITTEN=22251
12/03/06 00:58:21 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=340891
12/03/06 00:58:21 INFO mapred.JobClient: File Input Format Counters
12/03/06 00:58:21 INFO mapred.JobClient: Bytes Read=335470
12/03/06 00:58:21 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 00:58:21 INFO mapred.JobClient: Map input records=600
12/03/06 00:58:21 INFO mapred.JobClient: Spilled Records=0
12/03/06 00:58:21 INFO mapred.JobClient: Map output records=600
12/03/06 00:58:21 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
C-0{n=21 c=[29.552, 33.073, 35.876, 36.375, 35.118, 32.761, 29.566, 26.983, 25.272, 24.967, 25.691, 28.252, 30.994, 33.088, 34.015, 34.349, 32.826, 31
.053, 29.116, 27.975, 27.879, 28.103, 28.775, 30.585, 31.049, 31.652, 31.956, 31.278, 30.719, 29.901, 29.545, 30.207, 30.672, 31.366, 31.032, 31.567,
30.610, 30.204, 29.266, 29.753, 29.296, 29.930, 31.207, 31.191, 31.474, 32.154, 31.746, 30.771, 30.250, 29.807, 29.543, 29.397, 29.838, 30.489, 30.705
, 31.503, 31.360, 30.827, 30.426, 30.399] r=[0.979, 3.352, 5.334, 5.851, 4.868, 3.000, 3.376, 4.812, 5.159, 5.596, 4.940, 4.793, 5.415, 5.014, 5.155,
4.262, 4.891, 5.475, 6.626, 5.691, 5.240, 4.385, 5.767, 7.035, 6.238, 6.349, 5.587, 6.006, 6.282, 7.483, 6.872, 6.952, 7.374, 8.077, 8.676, 8.636, 8.6
97, 9.066, 9.835, 10.148, 10.091, 10.175, 9.929, 10.241, 9.824, 10.128, 10.595, 9.799, 10.306, 10.036, 10.069, 10.058, 10.008, 10.335, 10.160, 10.249,
10.222, 10.081, 10.274, 10.145]}
Weight: Point:
……...
……..
…….
1.0: [27.414, 25.397, 26.460, 31.978, 26.125, 27.463, 30.489, 34.929, 27.558, 30.686, 27.511, 32.269, 32.834, 27.129, 24.991, 32.610, 25.387,
32.674, 34.607, 33.519, 29.012, 28.705, 32.116, 29.121, 26.424, 33.452, 33.623, 29.457, 35.025, 26.607, 34.442, 34.847, 28.897, 34.439, 32.011, 34.816
, 27.773, 11.549, 20.219, 19.678, 14.715, 14.384, 15.556, 9.573, 10.636, 16.639, 17.236, 19.643, 18.317, 15.323, 19.106, 11.455, 16.888, 18.269, 11.58
3, 112/03/06 00:58:24 INFO driver.MahoutDriver: Program took 493470 ms
After the Mahout job was completed the output was stored as below:
js> #ls Found 3 items drwxr-xr-x - avkash supergroup 0 2012-03-06 01:05 /user/avkash/.oink drwxr-xr-x - avkash supergroup 0 2012-03-06 00:52 /user/avkash/output drwxr-xr-x - avkash supergroup 0 2012-03-06 00:49 /user/avkash/testdata js> #ls /user/avkash/output Found 3 items drwxr-xr-x - avkash supergroup 0 2012-03-06 00:53 /user/avkash/output/clusteredPoints drwxr-xr-x - avkash supergroup 0 2012-03-06 00:52 /user/avkash/output/clusters-0 drwxr-xr-x - avkash supergroup 0 2012-03-06 00:51 /user/avkash/output/data |
Now let’s analyzing mahout cluster output using clusterdump utility:
Clusterdump utility takes 3 parameters:
- –seqFileDir – this is the path folder where clustering sequence folder is (in this case output/clusters-0)
- –pointsDir – this is the path folder where clustering points folder is (in this case output/clusteredPoints)
- --output– this is the path where you would want to create your analysis result.
- Be sure that this parameter will force to create analysis result text in local machine not on HDFS
Running the command as below:
c:\Apps\dist\mahout\examples\bin>mahout clusterdump --seqFileDir output\clusters-0 --pointsDir output\clusteredPoints --output clusteranalyze.txt
"Running on hadoop, using HADOOP_HOME=c:\Apps\dist"
c:\Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver clusterdump --seqFileDir output\clusters-0 --pointsDir output\clusteredPoints --output clusteranalyze.txt
12/03/06 21:05:53 WARN driver.MahoutDriver: No clusterdump.props found on classpath, will use command-line arguments only
12/03/06 21:05:53 INFO common.AbstractJob: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=clusteranalyze.txt, --pointsDir=output\clusteredPoints, --seqFileDir=output\clusters-0, --startPhase=0, --tempDir=temp}
12/03/06 21:05:55 INFO driver.MahoutDriver: Program took 2031 ms
Now if you open folder at your machine, will find “clusteranalyze.txt” as below:
Opening clusteranalyze.txt shows the data as below:
Cluster Dumper Reference:
- https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper
- https://cwiki.apache.org/MAHOUT/cluster-dumper.html
Comments
- Anonymous
November 05, 2012
I run into the following when trying to run Mahout on my Azure environment. I don't have much experience with Windows shell scripting, so please forgive me if it's something obvious: c:appsdistmahoutbin>mahout Running here: c:appsdisthadoop-1.1.0-SNAPSHOTbinhadoop jar c:appsdistmah outbin..\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver Usage: java [-options] class [args...] (to execute a class) or java [-options] -jar jarfile [args...] (to execute a jar file) where options include: -d32 use a 32-bit data model if available -d64 use a 64-bit data model if available -server to select the "server" VM -hotspot is a synonym for the "server" VM [deprecated] The default VM is server. -cp <class search path of directories and zip/jar files> -classpath <class search path of directories and zip/jar files> A ; separated list of directories, JAR archives, and ZIP archives to search for class files. -D<name>=<value> set a system property -verbose[:class|gc|jni] enable verbose output -version print product version and exit -version:<value> require the specified version to run -showversion print product version and continue -jre-restrict-search | -no-jre-restrict-search include/exclude user private JREs in the version search -? -help print this help message -X print help on non-standard options -ea[:<packagename>...|:<classname>] -enableassertions[:<packagename>...|:<classname>] enable assertions with specified granularity -da[:<packagename>...|:<classname>] -disableassertions[:<packagename>...|:<classname>] disable assertions with specified granularity -esa | -enablesystemassertions enable system assertions -dsa | -disablesystemassertions disable system assertions -agentlib:<libname>[=<options>] load native agent library <libname>, e.g. -agentlib:hprof see also, -agentlib:jdwp=help and -agentlib:hprof=help -agentpath:<pathname>[=<options>] load native agent library by full pathname -javaagent:<jarpath>[=<options>] load Java programming language agent, see java.lang.instrument -splash:<imagepath> show splash screen with specified image See java.sun.com/.../reference for more details.