Apache Hadoop on Windows Azure: Few tips and tricks to manage your Hadoop cluster in Windows Azure
In Hadoop cluster, namenode communicate with all the other nodes. Apache Hadoop on Windows Azure have the following XML file which includes all the primary settings for Hadoop:
C:\Apps\Dist\conf\HDFS-SITE.XML
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property> <property> <name>dfs.name.dir</name> <======This is the NAME node data directory <value>c:\hdfs\nn</value> </property> <property> <name>dfs.data.dir</name> <========= This is the DATA node data directory <value>c:\hdfs\dn</value> </property> </configuration> |
C:\Apps\Dist\conf\Core-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/hdfs/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://10.26.104.45:9000</value> <== After the role started the VM gets IP Address and then included here </property> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> </configuration> |
C:\Apps\Dist\conf\Mapred-site.xml:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>10.26.104.45:9010</value> </property> <property> <name>mapred.local.dir</name> <value>/hdfs/mapred/local</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>2</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>1</value> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx1024m</value> </property> <property> <name>mapreduce.client.tasklog.timeout</name> <value>6000000</value> </property> <property> <name>mapred.task.timeout</name> <value>6000000</value> </property> <property> <name>mapreduce.reduce.shuffle.connect.timeout</name> <value>600000</value> </property> <property> <name>mapreduce.reduce.shuffle.read.timeout</name> <value>600000</value> </property> </configuration> |
You sure can make necessary changes to above setting however after that you would need to restart namenode as below:
- C:\Apps\Dist \> Hadoop namenode -format
For more command you can see check the Hadoop command line shortcut:
c:\apps\dist>hadoop Usage: hadoop [--config confdir] COMMAND where COMMAND is one of: namenode -format format the DFS filesystem secondarynamenode run the DFS secondary namenode namenode run the DFS namenode datanode run a DFS datanode dfsadmin run a DFS admin client mradmin run a Map-Reduce admin client fsck run a DFS filesystem checking utility fs run a generic filesystem user client balancer run a cluster balancing utility jobtracker run the MapReduce job Tracker node pipes run a Pipes job tasktracker run a MapReduce task Tracker node job manipulate MapReduce jobs queue get information regarding JobQueues version print the version jar <jar> run a jar file
distcp <srcurl> <desturl> copy file or directories recursively archive -archiveName NAME <src>* <dest> create a hadoop archive daemonlog get/set the log level for each daemon or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters. |
You can also use the following configuration related with Java logging which can be modified, however you would need to re-launch Java process again:
C:\Apps\Dist\conf\Log4j.properties:
hadoop.log.file=hadoop.log log4j.rootLogger=${hadoop.root.logger}, EventCounter log4j.threshhold=ALL
# # TaskLog Appender #
#Default values hadoop.tasklog.taskid=null hadoop.tasklog.noKeepSplits=4 hadoop.tasklog.totalLogFileSize=100 hadoop.tasklog.purgeLogSplits=true hadoop.tasklog.logsRetainHours=12
log4j.appender.TLA=org.apache.hadoop.mapred.TaskLogAppender log4j.appender.TLA.taskId=${hadoop.tasklog.taskid} log4j.appender.TLA.totalLogFileSize=${hadoop.tasklog.totalLogFileSize}
log4j.appender.TLA.layout=org.apache.log4j.PatternLayout log4j.appender.TLA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
# FSNamesystem Audit logging log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=WARN
# Custom Logging levels #log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG #log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG #log4j.logger.org.apache.hadoop.fs.FSNamesystem=DEBUG
# Jets3t library log4j.logger.org.jets3t.service.impl.rest.httpclient.RestS3Service=ERROR
# Event Counter Appender # Sends counts of logging messages at different severity levels to Hadoop Metrics. log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter |
Resources:
https://hadoop.apache.org/common/docs/current/cluster_setup.html
https://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/