Apache Spark SQL Query Returns No Results for a Column in Azure Synapse Notebook.

madmax 0

I'm running an Apache Spark SQL query in an Azure Synapse Notebook to retrieve data from an Azure Synapse table.

%%pyspark
df = spark.sql(""" 
    SELECT scheduledend 
    FROM `database`.`email` 
    WHERE scheduledend IS NOT NULL
""")
df.show(truncate = False)

The Pyspark query is returning zero rows.

Datatype of the column in database is datetime2, I tried casting column to different datatypes with in pyspark code.

However, the query works fine when queried directly in the database (in SSMS by connecting to synapse) and returns the expected results (4569 rows).

SELECT scheduledend FROM database.email WHERE scheduledend IS NOT NULL

1 answer

Chandra Boorla 8,795 Reputation points Microsoft Vendor

2025-02-21T04:00:46.19+00:00
Hi @madmax

Thank you for posting your query!

As I understand you are encountering an issue where your Apache Spark SQL query in Azure Synapse Notebook is returning zero rows, even though the same query works fine in SSMS.

Here are some considerations that might help you.

The Table is in a Dedicated SQL Pool - If the email table is in a Dedicated SQL Pool, Spark cannot query it directly. Instead, use a JDBC connection.

df = spark.read \ .format("jdbc") \ .option("url", "jdbc:sqlserver://<your-synapse-workspace>.database.windows.net:1433;database=<your_database>") \ .option("dbtable", "dbo.email") \ .option("user", "<your_username>") \ .option("password", "<your_password>") \ .load() df.show()

Database Context Mismatch - Your Synapse Notebook may not be connected to the correct database. Try explicitly setting the database before running the query.

spark.sql("USE your_database")

Also, confirm the correct schema and use the fully qualified table name.

df = spark.sql(""" SELECT scheduledend FROM `your_database`.`dbo`.`email` WHERE scheduledend IS NOT NULL """) df.show(truncate = False)

Check for Data Partitioning or Filters - If the data is partitioned or if there are any filters applied in the Spark environment, it might lead to zero results. Ensure that there are no additional filters or partitioning that could be affecting the results.

I hope this information helps. Please do let us know if you have any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Please sign in to rate this answer.
madmax 0 Reputation points

2025-02-21T04:52:11.4233333+00:00

I ran the following query in SSMS, connected to Synapse, and it returned data for both columns as expected, (returning date & time for column scheduledend and activityid for activityid column).

SELECT scheduledend, activityid FROM database.email WHERE activityid = '123456'

However, when I run the same query using PySpark:

%%pyspark df = spark.sql(""" SELECT scheduledend, activityid FROM `database`.`email` WHERE activityid = '123456' """) df.show()

The scheduledend column returns NULL, while the activityid is returned correctly. I tested with five other columns, and all of them returned data, except for scheduledend.

The schema and syntax are correct; however, I'm encountering an issue specifically with the scheduledend column. I’ve tried casting it from datetime2 to varchar, nvarchar, date, etc., but the problem persists.

Chandra Boorla 8,795 Reputation points Microsoft Vendor

2025-02-21T05:40:09.09+00:00

@madmax

It looks like you are encountering an issue where the scheduledend column returns NULL in PySpark, even though it works fine in SSMS.

Here are some potential reasons and solutions to help you troubleshoot this problem:

Data Type Compatibility - Since the scheduledend column is of type datetime2, there could be compatibility issues between how PySpark and SSMS interpret this data type. You can try explicitly casting the scheduledend column to a string format in your SQL query to see if that resolves the issue.

df = spark.sql(""" SELECT CAST(scheduledend AS STRING) AS scheduledend, activityid FROM `database`.`email` WHERE activityid = '123456' """) df.show(truncate=False)

Database Context - Ensure that your Spark session is connected to the correct database context. You can explicitly set the database context before running your query.

spark.sql("USE your_database")

JDBC Connection (For Dedicated SQL Pool) - If the email table resides in a Dedicated SQL Pool, Spark might have issues directly querying it. Using a JDBC connection can sometimes resolve data retrieval issues.

df = spark.read \ .format("jdbc") \ .option("url", "jdbc:sqlserver://<your-synapse-workspace>.database.windows.net:1433;database=<your_database>") \ .option("dbtable", "dbo.email") \ .option("user", "<your_username>") \ .option("password", "<your_password>") \ .load() filtered_df = df.filter(df.activityid == '123456') filtered_df.select("scheduledend", "activityid").show(truncate=False)

Check for NULL Values in scheduledend - It’s also worth verifying whether there are any NULL values in the scheduledend column for the specific activityid you're querying. You can run the following query in SSMS to check.

SELECT scheduledend, activityid FROM database.email WHERE activityid = '123456' AND scheduledend IS NULL

Data Partitioning or Filters - If your data is partitioned or if there are any filters applied in the Spark environment, it might affect the results. Ensure that no additional filters or partitioning are interfering with the query.

Debugging (Inspect All Data) - As a last resort, try retrieving all records from the email table and inspect the scheduledend column to see if the data appears as expected.

df_all = spark.sql("SELECT * FROM `database`.`email`") df_all.select("scheduledend", "activityid").show(truncate=False)

By following these steps, you should be able to identify the root cause of the issue with the scheduledend column.

I hope this information helps.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Apache Spark SQL Query Returns No Results for a Column in Azure Synapse Notebook.

1 answer

Your answer