Compartir a través de


Bookmark Lookup

In my last post, I explained how SQL Server can use an index to efficiently locate rows that qualify for a predicate.  When deciding whether to use an index, SQL Server considers several factors.  These factors include checking whether the index covers all of the columns that the query references (for the table in question).

What does it mean for an index to cover a column?

The heap or clustered index for a table (often called the “base table”) contains (or covers) all columns in the table.  Non-clustered indexes, on the other hand, contain (or cover) only a subset of the columns in the table.  By limiting the set of columns stored in a non-clustered index, we can store more rows on each page.  Thus, we save disk space and improve the efficiency of seeks and scans by reducing the number of I/Os and the number of pages that we touch.

Each non-clustered index covers the key columns that were specified when it was created.  Also, if the base table is a clustered index, each non-clustered index on this table covers the clustered index keys (often called the “clustering keys”) regardless of whether they are part of the non-clustered index’s key columns.  In SQL Server 2005, we can also add additional non-key columns to a non-clustered index using the “include” clause of the create index statement.

For example, given this schema:

create table T_heap (a int, b int, c int, d int, e int)

create index T_heap_a on T_heap (a)

create index T_heap_bc on T_heap (b, c)

create index T_heap_d on T_heap (d) include (e)

create table T_clu (a int, b int, c int, d int, e int)

create unique clustered index T_clu_a on T_clu (a)

create index T_clu_b on T_clu (b)

create index T_clu_ac on T_clu (a, c)

create index T_clu_d on T_clu (d) include (e)

The covered columns for each index are:

Index

Covered Columns

T_heap_a

a

T_heap_bc

b, c

T_heap_d

d, e

T_clu_a

a, b, c, d, e

T_clu_b

a, b

T_clu_ac

a, c

T_clu_d

a, d, e

Note that order is not relevant for covered columns.

How does this relate to index seeks and bookmark lookups?

Consider this query (using the above schema):

select e from T_clu where b = 2

At first glance, this query looks like a perfect candidate for an index seek using the non-clustered index on column b (T_clu_b) with the seek predicate “b = 2”.  However, this index does not cover column e so a seek or scan of this index cannot return the value of column e.  The solution is simple.  For each row that we fetch from the non-clustered index, we can lookup the value of column e in the clustered index.  We call this operation a “bookmark lookup.”  A “bookmark” is a pointer to the row in the heap or clustered index.  We store the bookmark for each row in the non-clustered index precisely so that we can always navigate from the non-clustered index to the corresponding row in the base table.

The following figure illustrates a bookmark lookup:

Showplan Samples

In SQL Server 2000, we implement bookmark lookup using a dedicated iterator whether the base table is a clustered index or heap:

  |--Bookmark Lookup(BOOKMARK:([Bmk1000]), OBJECT:([T_clu]))
|--Index Seek(OBJECT:([T_clu].[T_clu_b]), SEEK:([T_clu].[b]=2) ORDERED FORWARD)

In SQL Server 2005, we use a regular clustered index seek if the base table is a clustered index or a RID (record id) lookup if the base table is a heap:

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T_clu].[a]))
|--Index Seek(OBJECT:([T_clu].[T_clu_b]), SEEK:([tempdb].[dbo].[T_clu].[b]=(2)) ORDERED FORWARD)
|--Clustered Index Seek(OBJECT:([T_clu].[T_clu_a]), SEEK:([T_clu].[a]= [T_clu].[a]) LOOKUP ORDERED FORWARD)

  |--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1000]))
|--Index Seek(OBJECT:([T_heap].[T_heap_a]), SEEK:([T_heap].[a]=(2)) ORDERED FORWARD)
|--RID Lookup(OBJECT:([T_heap]), SEEK:([Bmk1000]=[Bmk1000]) LOOKUP ORDERED FORWARD)

You can tell that the clustered index seek is a bookmark lookup by the LOOKUP keyword in text showplan or by the attribute Lookup=“1” in XML showplan.  I will explain the behavior of the nested loops join in a future post.  The loop join and clustered index seek (or RID lookup) perform the same operation as the bookmark lookup in SQL Server 2000.

Tradeoffs

Bookmark lookup is not a cheap operation.  Assuming (as is commonly the case) that there is no correlation between the non-clustered and clustered index keys, each bookmark lookup performs a random I/O into the clustered index.  Random I/Os are very expensive.  When comparing various plan alternatives including scans, seeks, and seeks with bookmark lookups, the optimizer must decide whether it is cheaper to perform more sequential I/Os and touch more rows using an index scan or a seek with a less selective predicate that covers all required columns or to perform fewer random I/Os and touch fewer rows using a seek with a more selective predicate and a bookmark lookup.

In some cases, you can introduce a better plan option by creating a new index or by adding one or more columns to an existing index so as to eliminate a bookmark lookup or change a scan into a seek.  In SQL Server 2000, the only way to add columns to an index is to add additional key columns.  As I mentioned above, in SQL Server 2005, you can add also columns using the include clause of the create index statement.  Included columns are more efficient than key columns; they save disk space and make searching and updating the index more efficient.

Of course, whenever you create new indexes or add new key or included columns to an existing index, you do consume additional disk space and you make it more expensive to search and update the index.  Thus, you must balance the frequency and importance of the queries that benefit from the new index against the queries or updates that are slower.

To be continued again …

In my next post, I’ll go into a bit more detail about when we can perform index seeks and take a look at single vs. multi-column indexes.

Comments

  • Anonymous
    July 02, 2006
    PingBack from http://technote.thedeveloperside.com/?p=26

  • Anonymous
    July 13, 2006
    The optimizer must choose an appropriate “access path” to read data from each table referenced in a query. ...

  • Anonymous
    June 07, 2007
    In my last two posts, I discussed two scenarios - one involving updates and another involving large objects

  • Anonymous
    June 12, 2007
    Over the past month or so, I've looked at pretty much every isolation level except for read uncommitted

  • Anonymous
    June 26, 2007
    The comment has been removed

  • Anonymous
    July 03, 2007
    The comment has been removed

  • Anonymous
    January 18, 2008
    由于需要给同事培训数据库的索引知识,就收集整理了这个系列的博客。发表在这里,也是对索引知识的一个总结回顾吧。通过总结,我发现自己以前很多很模糊的概念都清晰了很多。 不论是 聚集索引,还是非聚集索引,都是用B

  • Anonymous
    May 06, 2008
    In this post from last week, I gave an example of a query with a conversion where the optimizer pushes

  • Anonymous
    July 13, 2008
    不论是聚集索引,还是非聚集索引,都是用B 树来实现的。我们在了解这两种索引之前,需要先了解B 树。如果你对B树不了解的话,建议参看以下几篇文章: BTree,B-Tree,B Tree,B*T...

  • Anonymous
    July 31, 2008
    SQLServer索引基础知识(2)----聚集索引,非聚集索引[来自]http://blog.joycode.com/ghj/archive/2008/01/02/113291.aspx...

  • Anonymous
    October 07, 2008
    In my last post , I explained the importance of asynchronous I/O and described how SQL Server uses sequential

  • Anonymous
    December 02, 2008
    PingBack from http://windy8848.yo2.cn/articles/sql-server-%e7%b4%a2%e5%bc%95%e5%9f%ba%e7%a1%80%e7%9f%a5%e8%af%862.html

  • Anonymous
    February 25, 2009
    In this post from last year, I discussed how random I/Os are slower than sequential I/Os (particularly

  • Anonymous
    March 16, 2009
    I have seen few users out there thinks that Deadlocks in SQL Server is a bug which has not be corrected

  • Anonymous
    March 13, 2012
    Craig, I just love your  blogs. Dude, you are good:)