Udostępnij za pośrednictwem


K-nearest neighbor spatial search

I apologize to everyone for the hiatus - I realise a post is more than a little overdue and will try to be more regular in the future. I appreciate the support that I've received for the blog and continuing it.

Consider the following problem: you're starting a website for a chain of restaurants, and you want to have a "Locations" feature where a person can enter their address and get a list of the 10 outlets closest to their location, in order of distance from their location (as the crow flies, for simplicity). Moreover, each time they hit the "next" button, they should get the next 10 locations in order of distance.

More abstractly, say you have a list of locations on a 2D map, each specified as a point with an x and y coordinate. You can do any sort of preprocessing you like of this list of points. Then I ask you to efficiently answer: given some point, which is not necessarily one of the points in the list, produce a list of the k points closest to that point, in order of their distance from that point.

This problem, called the k-nearest neighbor problem, is very well-studied and has a variety of solutions. Here we describe one very simple but effective incremental algorithm first proposed by Hjaltason and Samet in their "Ranking in Spatial Databases" in 1995 (Citeseer). A spatial database is a database specialized to perform this type of query over its (potentially very large) data set.

Consider the following very simple but inefficient algorithm: we take each point, one at a time, and add it to a priority queue with its key equal to the distance from the search point. We then extract minimum repeatedly to obtain the points in order of their distance from the search point. Once we've extracted the first k, we're done, and with a good data structure this requires only O(k log n) time. Unfortunately, even with the best priority queue data structures based on Fibonacci heaps, adding all n points requires amortized linear (O(n)) time and space, which for a very large number of points as one would expect to find in a spatial database is prohibitive.

One approach to narrowing down the points we have to examine is to split the space into larger regions. The idea is that points close to the search point should be found in regions close to the search point. In more detail, we first divide the space into two rectangular regions, each containing some of the points. Each of these regions is, in turn, subdivided into two rectangular regions, and we continue in this manner until all the resulting regions have some small constant number of points in them. The result is a tree of regions, where each node is a subregion of its parent node, and each point lies in one of the leaves. It really doesn't matter how we choose to divide up the space - the authors used PMR quadtrees, but you could use kd-trees, simple quadtrees, or whatever you like. All that matters is that in the end each leaf region contains only a small number of points.

Creating the regions requires examining all the points, but the key is that this part does not depend on the search point, and so only has to be done once in preprocessing. The regions may require rebalancing as points are added or removed, but this can be done efficiently as well.

However, there's a trick to this: if the search point lies near the boundary of a region, its closest points may actually lie in a different, adjacent region. How can we correctly locate the nearest point without spuriously gathering all the points in all the regions around the point?

Here's the key insight: our priority queue will contain two types of objects, single points and regions. The key of each point is set to its distance from the search point, while the key of each region is set to the minimum distance of that region from the search point. Each time we extract a point from the queue, we output it; each time we extract a region from the queue, we expand it by adding its subregions to the queue, or if it's a leaf region, the points in that region. Because the region's key is set to the minimum distance of that region from the search point, any point that is outputted before the region is expanded is closer to the search point than any point in the region. This ensures correctness.

k-nearest neighbor spatial search example

For example, consider the search problem shown to the right. There are 9 points, and we wish to find the 5 closest points to the big red search point in order. The space is recursively subdivided into regions as shown in the tree.

For the purpose of discussion we will refer to each region by the set of points it contains. Initially, the queue contains only the large region containing all points.

Current queue in order: {1,2,3,4,5,6,7,8,9}

This contains the red search point, so its minimum distance from the search point (its key) is zero. We remove it from the queue and add its two children {1,3,4,8} and {2,5,6,7,9}. The region {2,5,6,7,9} again contains the red search point and so has key zero. The minimum distance to the region {1,3,4,8} is shown by the red line extending out of the red point and to the left and is larger than zero.

Current queue in order: {2,5,6,7,9}, {1,3,4,8}

We extract and expand {2,5,6,7,9}, getting leaf region {2} and region {5,6,7,9}. The distance to region {2} is shown by the red line extending upward from the red point, which is clearly longer than the distance to region {1,3,4,8}. The region {5,6,7,9} has distance zero.

Current queue in order: {5,6,7,9}, {1,3,4,8}, {2}

We extract and expand {5,6,7,9} to get the regions {5,9} and {6,7}. The distance to region {6,7} is shown by the red line extending right from the red point, the longest distance so far encountered. The region {5,9} contains the red point and has distance zero.

Current queue in order: {5,9}, {1,3,4,8}, {2}, {6,7}

Next, we extract and expand region {5,9}. This is a leaf region, so we add the points 5, 9 to the queue. Point 5 is only slightly closer to the red point than region {2}, but farther than region {1,3,4,8}. Point 9 is farther than everything so far encountered.

Current queue in order: {1,3,4,8}, 5, {2}, {6,7}, 9

We extract and expand {1,3,4,8} to get {1,3,4}, at the same distance as {1,3,4,8}, and {8}, the distance of which is shown by the diagonal red line.

Current queue in order: {1,3,4}, {8}, 5, {2}, {6,7}, 9

We extract and expand {1,3,4}, which is a leaf region, and add the three points. Point 4 is closest, point 3 lies between {6,7} and 9 in distance, and point 1 is slightly closer than point 9.

Current queue in order: 4, {8}, 5, {2}, {6,7}, 3, 1, 9

We extract and output point 4, the closest point. We expand region {8} and add point 8, which is slightly farther than point 3.

Current queue in order: 5, {2}, {6,7}, 3, 8, 1, 9

We extract and output point 5. We expand region {2} and add point 2, which is slightly closer than point 8.

Current queue in order: {6,7}, 3, 2, 8, 1, 9

We extract and expand region {6,7}, adding points 6 and 7. Point 6 is farthest yet and point 7 even farther.

Current queue in order: 3, 2, 8, 1, 9, 6, 7

We've now expanded all regions and the remaining points in the queue are the remaining points not yet outputted in order.

The nice thing about this particular algorithm is that it's fully incremental: we could have stopped at any time once we had enough points, and pick up later when we want more. This makes it quite appealing in database queries, where you don't really know how many points you need in advance, because you might want to filter the results according to additional attributes. In practice, in a typical spatial database, one only needs to expand a small fraction of the regions in order to find a small number of closest points. The idea can be extended into higher-dimensional regions, spatial objects other than points such as areas and line segments, and even more general problems where one can easily compute the minimum key of a subtree of objects.

I hope you found this informative and I appreciate your comments.

Comments

  • Anonymous
    June 08, 2007
    nice to see u back posting...

  • Anonymous
    June 08, 2007
    Derrick, I like your blog - it keeps me sharp. A few questions:

  1. Why do you extract {1,3,4} into 3 literals instead of subdividing it into new regions (say, {1} and {3,4})?
  2. Couldn't you modify the algorithm so that the key of a rectangle is equal to the minimum distance of the points it contains? Then you could avoid the "surprise" that occurs when you expand {2}->2 and {8}->8.
  3. Finally, what is the running time of this algorithm? Gabe
  • Anonymous
    June 08, 2007
    Hi Gabe, thanks for your compliments and questions. To respond:
  1. I left the region {1,3,4} containing 3 points just to make the point that the regions don't have to be broken down all the way to a single point, but it's arbitrary how you divide them up. I certainly could have further divided it into subregions.
  2. The problem with this is that determining the minimum distance of the points it contains requires examining all the points in the region, which for a region containing many points is much slower than simply examining in constant time the specified boundaries of the region. Moreover, this minimum can change over time as points are added and removed. If the search point were fixed, you could maintain this minimum at each tree node with reasonable cost, but in the motivating example, it's more likely to change with every search.
  3. Running time: the paper doesn't do a very precise analysis, but in general the worst-case analysis isn't too promising, since one could arrange the points and regions into a pathological case where all the leaf regions are closer than the nearest point. The worst-case space is O(N) and worst-case time is O(N log N), as in other naive algorithms. In practice, however, the queue size tends to have a much smaller maximum size M, and the time to extract all N points is O(N log M). Hope this gives you some more insight.
  • Anonymous
    June 11, 2007
    Derrick, Regarding #2: I see your point about having to examine the points in the region, but how are regions chosen in the first place? Presumably you would want to choose subregions such that each has roughly half of the points in the parent region. But doing this requires examining some of the points in the region anyways, doesn't it?

  • Anonymous
    June 11, 2007
    To answer your question, Gabe, the subregions are only chosen once, before any queries are received. This is possible because the choice of subregions does not depend on the search point, whereas the minimum distance to each point in a region does. I may not have been clear about this.

  • Anonymous
    June 16, 2007
    Derrick, Reading your post came at an opportune time. I recently had a problem with spatial search. I'm using quad-trees to pick points with the mouse (representing segments of audio) projected to the screen. If I need to rank them by distance instead of just bounds-checking, I will certainly use this approach. My encounter is described here: http://cola-fan.livejournal.com/127333.html Fantastic to see you posting again.

  • Anonymous
    September 11, 2007
    FYI, an overview of nearest neighbor search algorithms can be found at http://en.wikipedia.org/wiki/Nearest_neighbor_(pattern_recognition)