Поделиться через


Hadoop Performance: How storage disk types in individual node will impact the job performance?

As you may have already know that Hadoop Cluster is network and disk, IO intensive. Recently I was trying to run a test scenario where I decided to change SATA hard disk to a high performance SSD Disk while keeping the cluster hardware the same. I was running the terra sort test to validate if having high performance SSD should have impacted the overall performance. I found that having SSD instead of SATA improved the test performance by ~20%.

 

After that, I tried to dig information on internet about other tests done in similar fashion to see what could be the best practice in this direction. The following recommendation I found from Intel by choosing appropriate combination of disk throughput, in-memory caching, cluster deployment and multi CPU box.

  • We found SSDs to be very effective for both read and write operation.
  • In-memory caching resulted in better response through setting right amount of “HEAP CACHE” to achieve higher cache hit percentage.
  • Cluster environment served the requests faster where as “CPU I/O WAIT” spikes were noticed.
  • Overall most of the CPUs remain idle during the test.

 

In a test demonstrates by Intel, that impact of going from two to four disks in a node (doubling the IO).

- Job was completed in half the time with previous IO

- Increasing server cost by 10% increased, sort performance by 100%.

 

If we consider MapReduce local directory where mapped files are stored locally, adding multiple same disks to this mount could improve the performance. Replacing

SATA with SSD or PCIe based flash cards can improve IO for certain jobs. Performance increases vary by workload however in a strict sense this increases the per server cost while decreasing the cost per job / transaction.

 

https://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/

Comments

  • Anonymous
    May 18, 2015
    Great analysis.  Have you done anymore analyses in the area?  We are configuring a cluster to perform Impala queries on Parquet files and looking for the fastest cluster configuration.