Thanks for reaching out to Microsoft Q&A.
Create a Linked Service for Azure SQL Database Managed Instance:
In your Synapse workspace, set up a linked service that points to your Azure SQL Database Managed Instance. This linked service will allow you to establish a connection between Synapse and your SQL Database.
Create a Spark Pool:
- Now, create a Spark pool within your Synapse workspace. A Spark pool defines the compute resource requirements for Spark instances.
- You can configure properties such as the number of nodes, node size, and scaling behavior.
- Choose from various node sizes, ranging from small to XXLarge, based on your workload needs.
- If you require a high degree of isolation, consider the Isolated Compute option, available with the XXXLarge node size.
Attach the Spark Pool to Your Synapse Notebook:
- Open your Synapse Notebook.
- Navigate to the Compute tab.
- Select Attached computes and choose Synapse Spark pool.
- This step associates your notebook with the Spark pool, enabling you to execute Spark jobs against your data sources, including the Azure SQL Database Managed Instance.
Write Spark Code in Your Notebook:
- Import the SQL Server JDBC Driver:
Synapse Spark pools don't have the SQL Server JDBC driver pre-installed. You'll need to import it into your notebook. You can upload the driver JAR file to your Synapse workspace and reference it in your code. - Create Spark DataFrame:
Use thespark.read.jdbc
function to create a Spark DataFrame from your SQL Database Managed Instance. Here's an example:
Python
from pyspark.sql import SparkSession
# Replace with your linked service name
linked_service_name = "your_linked_service_name"
# Retrieve connection string from linked service using Synapse library
access_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service_name)
# Build the JDBC URL
jdbc_url = f"jdbc:sqlserver://<server_name>.database.windows.net:<port>;databaseName=<database_name>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30"
# Create Spark DataFrame
df = spark.read \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", jdbc_url) \
.option("databaseName", "<database_name>") \
.option("accessToken", access_token) \
.option("dbtable", "<table>") \
.load()
Replace the placeholders with:
<server_name>
: Your Azure SQL Database Managed Instance server name.
<port>
: The port number of your Managed Instance (usually 1433).
<database_name>
: The name of the database you want to access.
<table>
: The name of the table you want to read data from
Process the DataFrame:
Use Spark functionalities to process the DataFrame (df
) as needed. You can filter, transform, or join data before further analysis.
View or Persist Data:
- Use
display(df)
to view the DataFrame contents in your notebook. - Use other Spark functions like
write
to save the data to different storage options (e.g., Data Lake Storage).
Important Note:
Currently, Synapse Spark pools don't directly support connecting to Azure SQL Database Managed Instances using linked services. The workaround is to use the connection string retrieved from the linked service and build the JDBC URL manually.
For more details and alternative approaches, refer to these resources:
- Microsoft Documentation: https://learn.microsoft.com/en-us/azure/synapse-analytics/synapse-link/connect-synapse-link-sql-database
Hope this helps. Do let us know if you any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.