Known issues for Apache Spark cluster on HDInsight
This document keeps track of all the known issues for the HDInsight Spark public preview.
Apache Livy leaks interactive session
When Apache Livy restarts (from Apache Ambari or because of headnode 0 virtual machine reboot) with an interactive session still alive, an interactive job session is leaked. As a result, new jobs can be stuck in the Accepted state.
Mitigation:
Use the following procedure to work around the issue:
Ssh into headnode. For information, see Use SSH with HDInsight.
Run the following command to find the application IDs of the interactive jobs started through Livy.
yarn application –list
The default job names will be Livy if the jobs were started with a Livy interactive session with no explicit names specified. For the Livy session started by Jupyter Notebook, the job name starts with
remotesparkmagics_*
.Run the following command to kill those jobs.
yarn application –kill <Application ID>
New jobs start running.
Spark History Server not started
Spark History Server is not started automatically after a cluster is created.
Mitigation:
Manually start the history server from Ambari.
Permission issue in Spark log directory
hdiuser gets the following error when submitting a job using spark-submit:
java.io.FileNotFoundException: /var/log/spark/sparkdriver_hdiuser.log (Permission denied)
And no driver log is written.
Mitigation:
- Add hdiuser to the Hadoop group.
- Provide 777 permissions on /var/log/spark after cluster creation.
- Update the spark log location using Ambari to be a directory with 777 permissions.
- Run spark-submit as sudo.
Spark-Phoenix connector is not supported
HDInsight Spark clusters do not support the Spark-Phoenix connector.
Mitigation:
You must use the Spark-HBase connector instead. For the instructions, see How to use Spark-HBase connector.
Issues related to Jupyter Notebooks
Following are some known issues related to Jupyter Notebooks.
Notebooks with non-ASCII characters in filenames
Do not use non-ASCII characters in Jupyter Notebook filenames. If you try to upload a file through the Jupyter UI, which has a non-ASCII filename, it fails without any error message. Jupyter does not let you upload the file, but it does not throw a visible error either.
Error while loading notebooks of larger sizes
You might see an error Error loading notebook
when you load notebooks that are larger in size.
Mitigation:
If you get this error, it does not mean your data is corrupt or lost. Your notebooks are still on disk in /var/lib/jupyter
, and you can SSH into the cluster to access them. For information, see Use SSH with HDInsight.
Once you have connected to the cluster using SSH, you can copy your notebooks from your cluster to your local machine (using SCP or WinSCP) as a backup to prevent the loss of any important data in the notebook. You can then SSH tunnel into your headnode at port 8001 to access Jupyter without going through the gateway. From there, you can clear the output of your notebook and resave it to minimize the notebook’s size.
To prevent this error from happening in the future, you must follow some best practices:
- It is important to keep the notebook size small. Any output from your Spark jobs that is sent back to Jupyter is persisted in the notebook. It is a best practice with Jupyter in general to avoid running
.collect()
on large RDD’s or dataframes; instead, if you want to peek at an RDD's contents, consider running.take()
or.sample()
so that your output doesn’t get too large. - Also, when you save a notebook, clear all output cells to reduce the size.
Notebook initial startup takes longer than expected
First code statement in Jupyter Notebook using Spark magic could take more than a minute.
Explanation:
This happens because when the first code cell is run. In the background this initiates session configuration and Spark, SQL, and Hive contexts are set. After these contexts are set, the first statement is run and this gives the impression that the statement took a long time to complete.
Jupyter Notebook timeout in creating the session
When Spark cluster is out of resources, the Spark and PySpark kernels in the Jupyter Notebook will time out trying to create the session.
Mitigations:
Free up some resources in your Spark cluster by:
- Stopping other Spark notebooks by going to the Close and Halt menu or clicking Shutdown in the notebook explorer.
- Stopping other Spark applications from YARN.
Restart the notebook you were trying to start up. Enough resources should be available for you to create a session now.
See also
Scenarios
- Apache Spark with BI: Perform interactive data analysis using Spark in HDInsight with BI tools
- Apache Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data
- Apache Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results
- Website log analysis using Apache Spark in HDInsight
Create and run applications
- Create a standalone application using Scala
- Run jobs remotely on an Apache Spark cluster using Apache Livy
Tools and extensions
- Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Spark Scala applications
- Use HDInsight Tools Plugin for IntelliJ IDEA to debug Apache Spark applications remotely
- Use Apache Zeppelin notebooks with an Apache Spark cluster on HDInsight
- Kernels available for Jupyter Notebook in Apache Spark cluster for HDInsight
- Use external packages with Jupyter Notebooks
- Install Jupyter on your computer and connect to an HDInsight Spark cluster