SYSK 300: Why TABLESAMPLE Is Not Returning Specified Number of Rows
One of the new features in SQL 2005 is the clause that allows you to get a random set of rows. The syntax supports specifying either a percentage or a number of rows to return:
TABLESAMPLE [SYSTEM] (sample_number [ PERCENT | ROWS ] ) [ REPEATABLE (repeat_seed) ]
For example, using AdventureWorks sample database, you can issue queries like this:
SELECT * FROM Production.Product TABLESAMPLE SYSTEM (10 PERCENT)
SELECT * FROM Production.Product TABLESAMPLE SYSTEM (100 ROWS)
Interestingly, while the Production.Product table has 504 rows, the results of the two queries above when executed multiple times varies from returning as few as 55 rows to 175 rows, in my tests.
The reason for this is explained in SQL 2005 documentation:
“It [TABLESAMPLE command] generates a random value for each physical 8 KB page in the table. Based on the random value for a page and the percentage specified in the query, a page is included in the sample or excluded. Each page that is included returns all rows in the sample result set. For example, when specifying TABLESAMPLE SYSTEM 10 PERCENT, SQL Server returns all the rows on approximately 10 percent of the specified table's data pages. If the rows are evenly distributed on the pages of the table, and if there is a sufficient number of pages in the table, the number of rows returned should approximate the sample size requested. However, as the random value generated for each page is independent of the values generated for any other page, it is possible that a larger, or smaller, percentage of pages than requested are returned. The TOP(n) operator can be used to limit the number of rows to a given maximum.”
NOTE: by specifying a random number generator seed, you can assure that you get same resultset every time. To do that, use REPEATABLE(yourseed) clause:
SELECT * FROM Production.Product TABLESAMPLE SYSTEM (100 ROWS) REPEATABLE (83746)
Beware: this is not like the repeatable read isolation level. If another user makes changes to the data in the table, you will not get back the exact same rows. It is only true for a given "version" of the table.
Finally, another way of getting random data is by ordering the table using NEWID() as follows:
SELECT TOP(100) * FROM Production.Product ORDER BY NewID()
Special thanks to Saleem Hakani for this post!
Comments
Anonymous
March 05, 2007
Whoa! Could you send me a copy of SQL Server 2006 ;) but only if it's Enterprise Irena Version SP3 :PAnonymous
March 05, 2007
Typo... Corrected. Thanks for letting me know.