Nearest Neighbors

Hi Folks,

Spatial users often want to find the object nearest a given point.  This operation, usually referred to as nearest neighbor search, is remarkably common in many areas of computer science.  In general, we may wish to find not only the nearest, but the k-nearest neighbors.

How can we accomplish this with SQL Server?  Here we’ll look at finding the single nearest neighbor; the extension to k-nearest neighbors is relatively straight forward.

First, let's examine the naive method for accomplishing this: simply order the table by distance and restrict the results.  For all of these examples, we’ll assume a table T with a spatial column g, as well as a parameter @x containing the search point:

 SELECT TOP(1) *
 FROM T 
 ORDER BY g.STDistance(@x) ASC

This solution certainly has simplicity on its side, but consider the work that needs to be done.  The entire table must be scanned, and the distance of each to the search point must be calculated.  Ouch.

We could conceivably improve on this by restricting our search space to to the immediate region around the target point:

 DECLARE @region geography = @x.STBuffer(10000)
  
 SELECT TOP(1) *
 FROM T 
 WHERE g.Filter(@region) = 1
 ORDER BY g.STDistance(@x) ASC

But this solution requires that we know our data very well: if there are no rows in the region, then we will fail to find the nearest neighbor; if there are too many, then we will again be left with a rather inefficient query.

How do we escape this morass?  We can do so by starting with a very small region---so small that we can be certain not to encounter too many results---and then keep enlarging it until we find something.  Doing this with a loop is not hard, but Steven Hemingray showed me how to do this with entirely declarative syntax:

 DECLARE @start FLOAT = 1000; 
  
 WITH NearestPoints AS
 (
    SELECT TOP(1) WITH TIES *,  T.g.STDistance(@x) AS dist
    FROM Numbers JOIN T WITH(INDEX(spatial_index)) 
    ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n)
    ORDER BY n
 )
 SELECT TOP(1) * FROM NearestPoints
 ORDER BY n, dist

This requires some explanation.  First, the @start parameter gives the initial region to search.  I’ve chosen one kilometer, but this can be adjusted downward if your data is very dense.  Second, you’ll notice that we make use of a Numbers table, which just contains the numbers 1 through n.  This  just contains a long list of integers, which is is useful in many situations.

The inner query examines a set of exponentially-expanding regions.  The ORDER BY clause along with the TOP(1) allows the query to stop as soon as it finds the smallest non-empty region.  The WITH TIES statement makes sure that all of the objects in that region will be in the result set.

Once the inner query returns a list of potential results, the outer query examines them to find which is actually nearest.  With this approach, we can select a start area small enough to keep the cost low in dense data, but also be guaranteed to find a distant nearest neighbor.

Incidentally, if you don’t already have a numbers table, you can create one quite quickly with some mildly-black magic like this:

 SELECT TOP 100000 IDENTITY(int,1,1) AS n 
 INTO numbers 
 FROM MASTER..spt_values a, MASTER..spt_values b
  
 CREATE UNIQUE CLUSTERED INDEX idx_1 ON numbers(n)

This isn’t a particularly pretty solution, but to proactively answer a question, we didn’t add a method for this primarily because we ran out of time.  Look for something more built-in the next go around.

Cheers,

-Isaac

Comments

  • Anonymous
    October 22, 2008
    Very innovative! (and I trust, more efficient that STDistance() - if I find the time I might do some timing tests of the three methods) But, ouch, that code is not very accessible unless you've got fairly advanced T-SQL knowledge.... I suspect many users for that reason will still end up using options 1) or 2). (Or perhaps, copying your code for 3) verbatim.) Would definitely be good to have this sort of thing built-in. Keep it on the wish-list! (But, do something to address the lack of spatial features in SSIS and SSRS first.) Keep up the good work!

  • Anonymous
    October 22, 2008
    The comment has been removed

  • Anonymous
    October 23, 2008
    Hmmm... interestingly, I'm not finding conclusive empiricial evidence to suggest that this approach is necessarily faster than the other two - I'm only doing some quick tests so my trials are not necessarily random or fair - has anybody else tried this? It will be hard to prove conclusively, since I guess the speed of each approach depends heavily on the actual geographic spread of data, the properties of the index(s) applied, and the values chosen for the Buffer Size in option 2) and the starting value @start in option 3) (which, like the buffer size, must be set to a fairly arbitary value)...

  • Anonymous
    October 23, 2008
    Interesting.  Let me try some simple tests.  For this, let's call the three methods above "naive", "limited", and "fancy". My a priori expectation would be that the time for the last two methods---limited and fancy---would be comparable, assuming the initial search size were the same.  But keep in mind that the limited solution is not guaranteed to produce a result. I did a quick comparison of the three methods on three data sets: a table of 11,427,230 US businesses, a table of the 212,942 census block groups, and a table of 32,108 zip code polygons.  For both, I looked for the nearest neighbor of the point 44 N 123 W  (POINT (-123 44)).  The results are from my dual-core laptop, and all elapsed times in milliseconds: Businesses


Naive: 357235 Limited: 7 Fancy: 18 I should note that the first time around, the limited solution failed to produce a result, since the nearest point was ~30 km away.  To make the results more fair, I moved my search point so there was an object within a kilometer, and dropped the search radius to 1 km.  This illustrates the fragility of the method. Also, these timings of exactly one run---after warming the caches, of course.   I kept the same search parameters for the census and zip-code data: Census

Naive: 65367 Limited: 16 Fancy: 24 Zip Codes

Native: 18921 Limited:  62 Fancy: 94 So, at least on some of my data, the limited solution seems the fastest (when it works), but both the limited and fancy solutions vastly outperform the naive implementation. Cheers, -Isaac

  • Anonymous
    October 23, 2008
    The comment has been removed

  • Anonymous
    October 26, 2008
    Interesting approach, however how could this be extended for a result of n nearest neighbors?

  • Anonymous
    October 27, 2008
    The comment has been removed

  • Anonymous
    November 03, 2008
    The comment has been removed

  • Anonymous
    November 03, 2008
    Craig, I'll try to post an n-nearest neighbor extension soon, but yes: this can be extended. Bob, Your problem is easier, and the query you post is the right one.  I'm curious how slow your query is, and how many results you're finding.   Keep in mind that if there are many results, we may not be able to do it terribly fast, since we have to actually calculate the distance for many of them.  The index will help, both in eliminating the results that are way out of bounds, as well as reducing the distance calculations for those that are very close to the center point, but anything near the boundary will have to be calculated. Cheers, -Isaac

  • Anonymous
    November 03, 2008
    Thanks for the quick reply Isaac. My query is taking 6 seconds and returns approx. 170K companies when searching for companies within 100km of Microsoft.

  • Anonymous
    November 03, 2008
    Hi Bob, I'm not terribly surprised by these results for the reasons I mention above.  Out of curiosity, though... First, for these queries, declare the following: declare @point geography = 0xE6100000010CEE7C3F355ED24740D578E92631885EC0 declare @circle geography = @point.STBuffer(100000) With these definitions (really, with @circle) first try: SELECT count() FROM dbo.Company WITH(INDEX(six_Company_geography)) WHERE g.STIntersects(@circle) = 1 This should essentially do the same thing your query does.  I'd expect very similar performance. Then try: SELECT count() FROM dbo.Company WITH(INDEX(six_Company_geography)) WHERE g.Filter(@circle) = 1 This query will only use the index and skip the actual spatial calculations.  I'd expect it to give a greater count than your original query, but run significantly faster because it won't have to compute any actual distances. Assuming I'm right, you're kind of stuck with the expensive query.  But at least you know where your time is going.  :) Cheers, -Isaac

  • Anonymous
    November 13, 2008
    One more question....my first test was with all geography values populated in my table. In my actual data, however, there are a non-trivial number of NULL values. This seems to affect the number of logical and physical reads significantly. The spatial index seems to be returning the NULL values before the filter removes them? Does that make sense?

  • Anonymous
    November 13, 2008
    Why is the filter method expensive?

  • Anonymous
    November 15, 2008
    Doh! Forget about my first comment re: performance - this method is quicker than the brute force approach - I had something funny going on with the index that this approach was using (don't even ask). Still think the Filter()/Buffer() method is useful in several situations though (particularly in applications where nearest-neighbor is combined with an additional distance limit - "Show me all the three closest petrol stations to my house, so long as none of them are further than 25km away", for example) I think I might have a slight correction to the code though. Currently, the example is not extendable to include the general case of k-nearest neighbors. If you try the following, for example: DECLARE @start FLOAT = 1000; WITH NearestPoints AS (   SELECT TOP(5) WITH TIES *,  T.g.STDistance(@x) AS dist   FROM Numbers JOIN T WITH(INDEX(spatial_index))   ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n)   ORDER BY n ) SELECT TOP(5) * FROM NearestPoints ORDER BY n, dist Then you (potentially) get duplicate rows in the output. This is because you might find only one feature that lies within the search area T.g.STDistance(@x) < @start*POWER(2,1). In order to select the TOP 5 features, the query carries on to select those features where T.g.STDistance(@x) < @start*POWER(2,2), which, obviously, includes that same feature again. To correct this, you need to sort the set of candidate features by the largest zone in which they were located (ORDER BY n DESC), then by the distance which they were away: DECLARE @start FLOAT = 1000; WITH NearestPoints AS (   SELECT TOP(1) WITH TIES *,  T.g.STDistance(@x) AS dist   FROM Numbers JOIN T WITH(INDEX(spatial_index))   ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n)   ORDER BY n ) SELECT TOP(1) * FROM NearestPoints ORDER BY n DESC, dist

  • Anonymous
    March 13, 2009
    Your statement - "The entire table must be scanned" doesn't follow, no more than a table scan is necessary to do "SELECT TOP 1" from a table with a conventional index. The problem here is that the query optimiser doesn't recognise that it can use the spatial index to eliminate the table scan.

  • Anonymous
    April 09, 2009
    I have a table with name, geom, type that have streets (linestrings) I try to get the nearest 10 streets, but sometimes the same street have various records, how I can group by same name and make a union of the geom (the various linestrings on one) ? any help will be greatly appreciated (sorry me bad english)