Cheaply estimating the number of records in a table

Counting the number of records in a table can be extremely expensive, as it requires reading all the database pages off disk. It is possible to get a reasonably accurate estimate the number of records in a table by using the JetGetRecordPosition API. JetGetRecordPosition should be called when the cursor is on a record and it fills in a JET_RECPOS structure with two numbers, the estimated position of the record table (centriesLT) and the estimated number of records (centriesTotal). Together these two numbers form a fraction indicating how far into the table the record is.

We calculate centriesTotal and centriesLT by looking at the fanout of the pages in the b-tree that lead to the current record so the numbers produced by JetGetRecordPosition aren't terribly accurate unless all records in the b-tree are the same size and are packed to the same density. The estimate can be improved greatly by taking multiple samples from different places in the b-tree. We can easily go to different places in the b-tree with the JetGotoPosition API, which tries to position the cursor to a fractional location in the table. This has the same problems as JetGetRecordPosition (it isn't very accurate) but it is perfect for this purpose.

So, we can simply use JetGotoPosition to position the cursor at different locations in the table and then call JetGetRecordPosition to get an estimate of the number of records in the b-tree. Averaging out the numbers should give a reasonable estimate. The code looks like this:

JET_ERR ErrGetApproximateRecordCount(const JET_SESID sesid, const JET_TABLEID tableid, const int cSamples, int * const pcRecords)
{
 *pcRecords = 0;
 
 JET_ERR err;
 
 int cRecordsTotal = 0;
 for(int i = 0; i < cSamples; ++i)
 {
  JET_RECPOS recpos = {0};
  recpos.cbStruct = sizeof(recpos);
  recpos.centriesTotal = cSamples + 1;
  recpos.centriesLT = i+1;
  
  if (JET_errSuccess != (err = JetGotoPosition(sesid, tableid, &recpos)))
   return err;
   
  if (JET_errSuccess != (err = JetGetRecordPosition(sesid, tableid, &recpos, sizeof(recpos))))
   return err;
   
  cRecordsTotal += recpos.centriesTotal;   
 }
 
 *pcRecords = cRecordsTotal / cSamples;
 
 return JET_errSuccess;
}

To see how accurate this is I wrote a test program that inserted between 1 and 100,000 records into a table, each record having between 1 and 1024 bytes of data. Then I called ErrGetApproximateRecordCount with cSamples ranging from 1 to 64. I ran a total of 1,000 trials of the test program and graphed the results. It looks like somewhere around 15 samples gives a reasonable accurate number (< 5% error). Assuming a non-cached table the 15 samples will cost 1 or 2 page reads each so it should take about 1/3 of a second, which is very cheap compared to the cost of reading in the entire table.

 

 

getapproximaterecordcount.png

Comments