Remote Spark Compute Context using PuTTY on Windows

If you are running Microsoft R Server/Microsoft R Client from a Windows computer equipped with PuTTY, you can create a compute context that will run RevoScaleR functions from your local client in a distributed fashion on your Hadoop cluster. You use RxSpark to create the compute context, but use additional arguments to specify your user name, the file-sharing directory where you have read and write access, the publicly-facing host name or IP address of your Hadoop cluster’s name node or an edge node that will run the master processes, and any additional switches to pass to the ssh command (such as the -i flag if you are using a ppk file for authentication).

RxSpark assumes that the directory containing the plink.exe command (PuTTY) is in your path. If not, you can specify the location of these files using the sshClientDir argument.

In some cases, you may find that environment variables needed by Hadoop are not set in the remote sessions run on the sshHostname computer. This may be because a different profile or startup script is being read on ssh login. You can specify the appropriate profile script by specifying the sshProfileScript argument to RxSpark (this should be an absolute path).

Here is an example of using remote Spark Compute Context:

OUTPUT: pu1
 

 

 

 

 

 

 

 

 

 

 

Troubleshooting

 

Using RClient 3.3.2/MRS 9.0.1, you might get the following error messages while running the above program:

 Error in rxFindWindowsSsh("plink.exe", "ssh.exe", sshClientDir = "") : Could not find neither plink.exe nor ssh.exe.

 

 plink: unknown option \"-p\"

 

We have released Hotfix to address the above error messages. You can download the fix and instructions here.
 

Few other possible error messages and troubleshooting steps:

 Error in checkSshReturnMessage(printOut) : Access denied. Please validate the host connection credential file.

 

 Error in checkSshReturnMessage(printOut) : Host key verification failed. Please validate host connection using the specified ssh tool directly and try again.

 

 Error: Fail to fetch remote MRS version; text output is: Unable to open connection:, Host does not exist

 

If you get the above error messages, ensure the following:

  1. The public key is set properly in /home/sshuser/.ssh/authorized_keys file. Check for extra spaces, newline, extra characters etc and remove them.
  2. Key comment property of the public key does not have a space since OpenSSH's SSH-2 key format contains no space for a comment
  3. sshuser has read access to /home/sshuser/.ssh/authorized_keys
  4. azureuser has read access C:\Users\azureuser\Documents\privateKey.ppk
  5. Connection to the edge node is successful using PuTTY.

Here is a detailed document on how to Use SSH with HDInsight (Hadoop) from PuTTY on Windows.
Please use the comments section in this article to report any other SSH issues when using RxSpark.

Comments

  • Anonymous
    May 24, 2017
    Running the following give and error:system.time( modelLocal <- rxLogit(formula, data = airOnTimeDataLocal))Error in rxFindWindowsSsh("pscp.exe", "scp.exe", sshClientDir) : Could not find neither pscp.exe nor scp.exe.I've configured RStudio ... and can successfully runrxHadoopListFiles("/target/on/hdfs")
    • Anonymous
      May 24, 2017
      do you have pscp.exe in C:\Program Files (x86)\PuTTY ?