Query data
Querying data is the foundational step for performing nearly all data-driven tasks in Azure Databricks. Regardless of the language or tool used, workloads start by defining a query against a table or other data source and then performing actions to gain insights from the data. This article outlines the core concepts and procedures for running queries across various Azure Databricks product offerings, and includes code examples you can adapt for your use case.
You can query data interactively using:
- Notebooks
- SQL editor
- File editor
- Dashboards
You can also run queries as part of Delta Live Tables pipelines or jobs.
For an overview of streaming queries on Azure Databricks, see Query streaming data.
What data can you query with Azure Databricks?
Azure Databricks supports querying data in multiple formats and enterprise systems. The data you query using Azure Databricks falls into one of two broad categories: data in a Databricks lakehouse and external data.
What data is in a Databricks lakehouse?
The Databricks Data Intelligence Platform stores all of your data in a Databricks lakehouse by default.
This means that when you run a basic CREATE TABLE
statement to make a new table, you have created a lakehouse table. Lakehouse data has the following properties:
- Stored in the Delta Lake format.
- Stored in cloud object storage.
- Governed by Unity Catalog.
Most lakehouse data on Azure Databricks is registered in Unity Catalog as managed tables. Managed tables provide the easiest syntax and behave like other tables in most relational database management systems. Managed tables are recommended for most use cases and are suitable for all users who don’t want to worry about the implementation details of data storage.
An unmanaged table, or external table, is a table registered with a LOCATION
specified. The term external can be misleading, as external Delta tables are still lakehouse data. Unmanaged tables might be preferred by users who directly access tables from other Delta reader clients. For an overview of differences in table semantics, see What are tables and views?.
Some legacy workloads might exclusively interact with Delta Lake data through file paths and not register tables at all. This data is still lakehouse data, but can be more difficult to discover because it’s not registered to Unity Catalog.
Note
Your workspace administrator might not have upgraded your data governance to use Unity Catalog. You can still get many of the benefits of a Databricks lakehouse without Unity Catalog, but not all functionality listed in this article or throughout the Azure Databricks documentation is supported.
What data is considered external?
Any data that isn’t in a Databricks lakehouse can be considered external data. Some examples of external data include the following:
- Foreign tables registered with Lakehouse Federation.
- Tables in the Hive metastore backed by Parquet.
- External tables in Unity Catalog backed by JSON.
- CSV data stored in cloud object storage.
- Streaming data read from Kafka.
Azure Databricks supports configuring connections to many data sources. See Connect to data sources.
While you can use Unity Catalog to govern access to and define tables against data stored in multiple formats and external systems, Delta Lake is a requirement for data to be considered in the lakehouse.
Delta Lake provides all of the transactional guarantees in Azure Databricks, which are crucial for maintaining data integrity and consistency. If you want to learn more about transactional guarantees on Azure Databricks data and why they’re important, see What are ACID guarantees on Azure Databricks?.
Most Azure Databricks users query a combination of lakehouse data and external data. Connecting with external data is always the first step for data ingestion and ETL pipelines that bring data into the lakehouse. For information about ingesting data, see Ingest data into an Azure Databricks lakehouse.
Query tables by name
For all data registered as a table, Databricks recommends querying using the table name.
If you’re using Unity Catalog, tables use a three-tier namespace with the following format: <catalog-name>.<schema-name>.<table-name>
.
Without Unity Catalog, table identifiers use the format <schema-name>.<table-name>
.
Note
Azure Databricks inherits much of its SQL syntax from Apache Spark, which does not differentiate between SCHEMA
and DATABASE
.
Querying by table name is supported in all Azure Databricks execution contexts and supported languages.
SQL
SELECT * FROM catalog_name.schema_name.table_name
Python
spark.read.table("catalog_name.schema_name.table_name")
Unity Catalog identifier resolution
Databricks recommends using fully-qualified identifiers when queries or workloads interact with database objects stored across multiple schemas or catalogs.
The following table outlines behaviors for partially qualified and unqualified identifiers:
Identifier pattern | Behavior |
---|---|
catalog_name.schema_name.object_name |
Refers to the database object specified by the identifier. |
schema_name.object_name |
Refers to the database object associated with the specified schema_name and object_name in the current catalog. |
object_name |
Refers to the database object associated with the specified object_name in the current catalog and schema. |
What is the current catalog and schema?
In interactive compute environments, use current_catalog()
and current_schema()
to confirm your current catalog and schema.
All workspaces configured with Unity Catalog have a default catalog set at the workspace level. See Manage the default catalog.
The following table describes configurations for Databricks products that might override the workspace default catalog:
Product | Configuration |
---|---|
All-purpose or jobs compute | Set the Spark configuration spark.databricks.sql.initial.catalog.namespace when configuring compute. |
Delta Live Tables | The catalog and schema specified during pipeline configuration override the workspace defaults for all pipeline logic. |
Note
Default catalog or schema might also be set by JDBC configurations when connecting to external systems or metastores. Contact the administrator responsible for configuring your Databricks compute and integrated systems if you encounter unexpected default behavior.
Use the USE CATALOG
or USE SCHEMA
syntax to specify the current catalog or schema for your current session. The current catalog or schema is used when a query or statement uses a partially qualified or unqualified indentifier.
Statement | Result |
---|---|
USE CATALOG catalog_name |
Sets the current catalog using the provided catalog_name . Sets the current schema to default . |
USE SCHEMA schema_name |
Sets the current schema using the provided schema_name in the current catalog. |
USE SCHEMA catalog_name.schema_name |
Set the current catalog using the provided catalog_name and the current schema using the provided schema_name . |
Note
Queries and commands that use fully qualified identifiers to interact with objects like tables, views, functions, or models do not change the current catalog or schema and always refer to the object specified.
Query data by path
You can query structured, semi-structured, and unstructured data using file paths. Most files on Azure Databricks are backed by cloud object storage. See Work with files on Azure Databricks.
Databricks recommends configuring all access to cloud object storage using Unity Catalog and defining volumes for object storage locations that are directly queried. Volumes provide human-readable aliases to locations and files in cloud objects storage using catalog and schema names for the filepath. See Connect to cloud object storage and services using Unity Catalog.
The following examples demonstrate how to use Unity Catalog volume paths to read JSON data:
SQL
SELECT * FROM json.`/Volumes/catalog_name/schema_name/volume_name/path/to/data`
Python
spark.read.format("json").load("/Volumes/catalog_name/schema_name/volume_name/path/to/data")
For cloud locations that aren’t configured as Unity Catalog volumes, you can query data directly using URIs. You must configure access to cloud object storage to query data with URIs. See Configure access to cloud object storage for Azure Databricks.
The following examples demonstrate how to use URIs to query JSON data in Azure Data Lake Storage Gen2, GCS, and S3:
SQL
SELECT * FROM json.`abfss://container-name@storage-account-name.dfs.core.windows.net/path/to/data`;
SELECT * FROM json.`gs://bucket_name/path/to/data`;
SELECT * FROM json.`s3://bucket_name/path/to/data`;
Python
spark.read.format("json").load("abfss://container-name@storage-account-name.dfs.core.windows.net/path/to/data")
spark.read.format("json").load("gs://bucket_name/path/to/data")
spark.read.format("json").load("s3://bucket_name/path/to/data")
Query data using SQL warehouses
Azure Databricks uses SQL warehouses for compute in the following interfaces:
- SQL editor
- Databricks SQL queries
- Dashboards
- Legacy dashboards
- SQL alerts
You can optionally use SQL warehouses with the following products:
- Databricks notebooks
- Databricks file editor
- Databricks Jobs
When you query data with SQL warehouses, you can use only SQL syntax. Other programming languages and APIs are not supported.
For workspaces that are enabled for Unity Catalog, SQL warehouses always use Unity Catalog to manage access to data sources.
Most queries that are run on SQL warehouses target tables. Queries that target data files should leverage Unity Catalog volumes to manage access to storage locations.
Using URIs directly in queries run on SQL warehouses can lead to unexpected errors.
Query data using all purpose compute or jobs compute
Most queries that you run from Databricks notebooks, workflows, and the file editor run against compute clusters configured with Databricks Runtime. You can configure these clusters to run interactively or deploy them as jobs compute that power workflows. Databricks recommends that you always use jobs compute for non-interactive workloads.
Interactive versus non-interactive workloads
Many users find it helpful to view query results while transformations are processed during development. Moving an interactive workload from all-purpose compute to jobs compute, you can save time and processing costs by removing queries that display results.
Apache Spark uses lazy code execution, meaning that results are calculated only as necessary, and multiple transformations or queries against a data source can be optimized as a single query if you don’t force results. This contrasts with the eager execution mode used in pandas, which requires calculations to be processed in order before passing results to the next method.
If your goal is to save cleaned, transformed, aggregated data as a new dataset, you should remove queries that display results from your code before scheduling it to run.
For small operations and small datasets, the time and cost savings might be marginal. Still, with large operations, substantial time can be wasted calculating and printing results to a notebook that might not be manually inspected. The same results could likely be queried from the saved output at almost no cost after storing them.