Share via


Tuning Spatial Point Data Queries in SQL Server 2012

Tuning Spatial Point Data Queries in SQL Server 2012

Written by: Ed Katibah, Milan Stojic, Michael Rys, Nicholas Dritsas

Technical reviewers: Chuck Heinzelman

Introduction

Spatial Point Data queries require particular tuning efforts to enhance performance and improve overall application through put.  SQL Server 2012 introduces a few key new features and improvements to assist you in that goal.

Below, we go over a few key best practices and suggestions on how to achieve that.

Optimize the primary key clustered index

To create an index on a spatial column, the table must provide a primary key.  SQL Azure extends this requirement by requiring that each table have a clustered primary key.

Spatial indexing basics

Spatial query execution with an index contains two parts: primary filter (spatial index lookup) and secondary filter (original spatial predicate). So with the queries like “STDistance() < x”, SQL Server will:

  • identify a candidate set of cells
  • run an index seek over the spatial index (primary filter)
  • join candidate values with the base table to get the actual spatial objects
  • filter any false positives by performing the spatial operation (secondary filter)

                                             

The secondary filter can be avoided by using Filter(@)=1 operation instead of STIntersects(@x)=1. The query performance can be much better but false positives are possible in the result. If the spatial query is not too large, for HHHH spatial index, tolerance will be around 100-200m when using the Geography data type.

STDistance(@x) < @range operator can be replaced with Filter(@x.BufferWithTolerance(@range,1,1)) =

If returning some false positives is not an option, IO performance of the base table can still be optimized.

Any spatial index is already optimized to minimize IO. Cells are places in a specific pattern so that spatially closer cells sit physically close to each other. The following picture illustrates the cell pattern used (a Hilbert space-filling curve is used in SQL Server).

 

 Therefore, the reading pattern for an index seek over spatial index might look like:

 

However, as the join to the base table is needed to get actual spatial objects, the IO pattern for the clustered index seek (the base table) might look like:

 

 This IO pattern will result in reading many more pages from the disk and will have high overall impact on spatial query performance.

 So, in order to get better IO performance for the clustered index seek, the Primary Key needs to be correlated with spatial index cell pattern. This can be achieved by adding point coordinates components to the clustered index.

 For our sample data, here is how the original table containing the point column was specified:

CREATE TABLE [dbo].[Points](
       [id]   [int] NOT NULL,
       [type] [int] NOT NULL,
       [geo] [geography] NULL
  PRIMARY KEY CLUSTERED 
    ( 
       [id] ASC,
       [type] ASC
));

Spatial indexes reference the primary key to access the base table rows.  Query performance can often be significantly enhanced by creating a clustered index on the primary key which incorporates the individual ordinates of the underlying point coordinate.  This technique allows the primary key to be “spatially” ordered thereby minimizing the number of disk seeks required to fetch the rows of base table data requested by the spatial index.

On occasion, the individual point coordinate ordinates can be found in their own columns in the base table along with the column of points (spatial object).  When this is the case, the individual point ordinates can be referenced directly in the clustered primary key DDL.  When this is not the case, the table schema can be expanded to include two persisted computed columns which contain the individual ordinates of the underlying point coordinates. 

It is important to note that the new point ordinate columns must be the first two columns referenced in the clustered primary key. For geometry coordinates, they should be in the x, y order.  For geography data they should be in latitude, longitude order.

For the sample data table, it is necessary to expand the schema as follows, creating two new computed persisted columns using spatial methods and redefining the clustered primary key index with these two new columns (note the use of the built-in Lat and Long methods to extract the individual ordinates):

CREATE TABLE [dbo].[Points](
  [id]   [int] NOT NULL,
  [type] [int] NOT NULL,
  [geog] [geography] NOT NULL,
  [geo_lat]  AS ([geog].[Lat])  PERSISTED NOT NULL, -- *
  [geo_lon]  AS ([geog].[Long]) PERSISTED NOT NULL, -- *
  PRIMARY KEY CLUSTERED 
    (
      [geo_lat] ASC, 
      [geo_lon] ASC, 
      [id] ASC,
      [type] ASC
));

* not null due to use in primary key

NOTE: Persistence of the x,y / lat,long points for the spatial object as the first columns in the clustered key adds 16 bytes to the key, which is then duplicated in both the base table as well as the spatial index itself, as well as every non-clustered index added to the table. The regular guidance for clustered index keys is to keep them as small as possible. Even though this could potentially improve spatial index performance, users should be aware of the potential negative consequences.

Optimize the spatial index

The default spatial index has traditionally been recommended with grids all set to MEDIUM based on a general spatial workload.  This was original index definition:

CREATE SPATIAL INDEX [table_geog_sidx] 
  ON [dbo].[table]([geog])
    USING  GEOGRAPHY_GRID 
      WITH (GRIDS =(LEVEL_1 = MEDIUM,
                    LEVEL_2 = MEDIUM,
                    LEVEL_3 = MEDIUM,
                    LEVEL_4 = MEDIUM), 
            CELLS_PER_OBJECT = 16;

Spatial query performance can often be improved significantly by selecting spatial index options which differ from default settings for specific spatial shapes and distributions.  In the case of point data, it has been found in most, if not in all, cases that spatial indexes with all grid levels set to HIGH outperform other configurations.  Since we are dealing with point data, the CELLS_PER_OBJECT setting is irrelevant and can be set to any legal value (1-8192) without effect. Here is the spatial index reconfigured to grid levels with the optimal setting for points, all HIGH:

CREATE SPATIAL INDEX [table_geog_sidx] 
  ON [dbo].[table]([geog])
    USING  GEOGRAPHY_GRID 
      WITH (GRIDS =(LEVEL_1 = HIGH,
                    LEVEL_2 = HIGH,
                    LEVEL_3 = HIGH,
                    LEVEL_4 = HIGH), 
            CELLS_PER_OBJECT = 16;

 
Consider using a spherical Earth model  if using geography data type

 If you are using SQL Azure or SQL Server 2012 and your accuracy requirements for the spatial results do not require that an ellipsoidal model be used, additional query performance can potentially be gleaned by using a spherical model as the basis for the underlying spatial coordinate system due to the simpler mathematics involved in calculations. Additionally, the STDistance(), STLength() and ShortestLineTo() methods are optimized to run faster on a sphere than on an ellipsoid.

 In SQL Server 2012 and SQL Azure, there is a new entry for a unit sphere in the  sys.spatial_reference_systems view. This new entry has a spatial reference identifier (SRID) = 104001 and is listed as a unit sphere.

 The native units of measure output when using the unit sphere are radians. To compute measures in real world values, such as length and area in meters, simply multiply results by the radius of the desired output sphere as follows:

  • For linear measure (length, distance, buffer distance):   length * (sphere radius)
  • For area measure:   area * (sphere radius) * (sphere radius)

 In the case of our sample data, the original coordinate system was specified as World Geodetic Reference System of 1984 (WGS84).  In SQL Server and SQL Azure, this is identified as spatial reference identifier (SRID) 4326. This is the most common coordinate system for use with the geography data type (most commercial spatial data uses these coordinate systems, as do GPS receivers). For many web mapping programs which reference WGS84 (Bing Maps, Google Maps, etc.), a radius of 6,378,137 meters is used for their underlying spherical spatial reference system (“Spherical Mercator”).

 To update the SRID for each point object to the unit sphere reference system in the sample spatial table, the following T-SQL can be used (note the use of the STSrid method):

UPDATE Points
  SET geog.STSrid = 104001

 
Performance testing

 To test these tuning recommendations, a customer-driven scenario was used along with their data.  To provide an idea of the scope of the query, the following stored procedure is provided as an example:

 /****** Object:  StoredProcedure [dbo].[TEST]    Script Date: 12/13/2011 3:14:50 PM ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

ALTER PROCEDURE [dbo].[TEST]
AS
BEGIN
  -- SET NOCOUNT ON added to prevent extra result sets from
  -- interfering with SELECT statements.
  --set statistics time on
  --set statistics IO on

  SET NOCOUNT ON;
  DECLARE @Location geography,
          @continent integer

  SELECT @Location = CI.city_center_coordinates,
         @continent = CO.continent_id
    FROM dbo.geo2_city AS CI
      INNER JOIN dbo.geo2_country AS CO ON CO.country_id = CI.country_id
    WHERE CI.city_id = 9395
      AND CI.rec_status > -1
      AND CO.rec_status > -1 

      SET @location = geography::Point(@location.Lat, @location.Long, 4326)

      SELECT GI.geo_id, GI.geo_info.STDistance(@Location)
        FROM db_core.dbo.geo_informations2 AS GI with(nolock,
            INDEX(SIndx_geography_informations2)) 
          INNER JOIN  dbo.product_hotels (nolock) AS PH 
            ON (PH.hotel_id = GI.geo_id) 
          WHERE GI.geo_type = 7
            AND PH.rec_status = 1
            AND PH.region_id = 2 

            AND GI.geo_info.STDistance(@Location) < 10000
END    

 For discussion clarity, the following T-SQL, pre-optimization, captures the core spatial query used in the stored procedure (the SQL Server spatial methods are highlighted). This corresponds to the T-SQL illustrated in the light blue box, above.

SELECT p.geo_id, p.geo_info.STDistance(@Location)
  FROM Points AS p WITH(nolock,   
      INDEX(SIndx_geography_informations)) –- original, non-optimized index 
    WHERE GI.geo_info.STDistance(@Location) < 10000 -– 10KM (10,000 meters)

When the coordinate system was changed from SRID=4326 to SRID=104001, it was necessary to rewrite the query as follows to accommodate the coordinate system units change and assure that the new, optimized spatial index was used:

SELECT GI.geo_id, GI.geo_info.STDistance(@Location) * 6378137 
  FROM Points_UnitSphere AS GI WITH(nolock,   
      INDEX(geo_info_HHHH_sidx)) –- optimized index 
    WHERE GI.geo_info.STDistance(@Location) * 6378137 < 10000 – 10KM

Adding the multiplication in the WHERE clause causes the spatial index to be ignored in the query plan (in fact, the query compiler complains about the use of the spatial index hint, when written in this fashion).  To alleviate this issue, the query can be rewritten as follows:

SELECT p.id, p.info.STDistance(@Location) * 6378137 
  FROM Points_UnitSphere AS p WITH(nolock,   
      INDEX(geo_info_HHHH_sidx))  
    WHERE GI.geo_info.STDistance(@Location) < 10000/6378137 -- 10KM

 
Conclusions

 Quick summary of the potential techniques to improve spatial point data performance:

  • For point data, create an “all HIGH” manual grid spatial index
  • Use the Filter operation to avoid secondary filter, or
  • Create a clustered primary key on point table using point coordinate components
  • For the Geography data type, using a spherical Earth model instead of ellipsoidal model
  • Use the spatial_window_max_cells query hint to fine tune query performance

Prior to optimization, use of the SQL Server STDistance method in the SELECT and WHERE clauses returned an average performance of 30 milliseconds.

After optimization (new table with point clustered primary key, new point index and use of the sphere-optimized methods), use of the SQL Server STDistance method in the SELECT and WHERE clauses returned an average performance of 18 milliseconds.

Description

Execution Time in ms (average*)

Original Spatial Query

30

Optimized Spatial Query

18

* each query for this test was executed 1,000 times for determination of the average time


Other Languages