Apache Spark guidelines
This article provides various guidelines for using Apache Spark on Azure HDInsight.
How do I run or submit Spark jobs?
How do I monitor and debug Spark jobs?
Option | Documents |
---|---|
Azure Toolkit for IntelliJ | Failure spark job debugging with Azure Toolkit for IntelliJ (preview) |
Azure Toolkit for IntelliJ through SSH | Debug Apache Spark applications locally or remotely on an HDInsight cluster with Azure Toolkit for IntelliJ through SSH |
Azure Toolkit for IntelliJ through VPN | Use Azure Toolkit for IntelliJ to debug Apache Spark applications remotely in HDInsight through VPN |
Job graph on Apache Spark History Server | Use extended Apache Spark History Server to debug and diagnose Apache Spark applications |
How do I make my Spark jobs run more efficiently?
Option | Documents |
---|---|
IO Cache | Improve performance of Apache Spark workloads using Azure HDInsight IO Cache (Preview) |
Configuration options | Optimize Apache Spark jobs |
How do I connect to other Azure Services?
Option | Documents |
---|---|
Apache Hive on HDInsight | Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector |
Apache HBase on HDInsight | Use Apache Spark to read and write Apache HBase data |
Apache Kafka on HDInsight | Tutorial: Use Apache Spark Structured Streaming with Apache Kafka on HDInsight |
Azure Cosmos DB | Azure Synapse Link for Azure Cosmos DB |
What are my storage options?
Option | Documents |
---|---|
Azure Data Lake Storage Gen2 | Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters |
Azure Blob Storage | Use Azure storage with Azure HDInsight clusters |