Index creation performance issues
In this post: Index creation performance question I asked: Is it faster to fill a table with values, then index it or create the index on an empty table, then fill it? Why?
After a record in a cursor is modified, any associated index must be updated. If there is already an index, then after each record added, the index needs to be checked.
If the index is created on the filled table, then a single, specialized optimized index creation routine can be invoked within VFP to create the index.
For example, I profiled the code for creating 10000 records. If the special routine is invoked to create the index on these records,
If the index is created beforehand, then the routine to add a node (let’s call it AddNode) to the index is called 10001 times (the extra 1 time is for the index itself: multiple indices can be stored in an index: see What is an index anyway?)
If the index is created after, then the special routine is called, which reduces the number of calls to AddNode to 312!
On my machines, the index after scenario is about 3-5 times faster than the index before! What results do you get?
Next question: what effect will there be if you use random data rather than a constant string? You may be surprised by the results!
(Another question: Why does this code not use SYS(2015) - Unique Procedure Name to get random values?)
SET EXCLUSIVE ON
SET SAFETY OFF
num=100000
?"Create Index before = ",TestIt(.t.,num)
?"Create Index after = ",TestIt(.f.,num)
USE
n=ADIR(aa,"test.cdx")
?"Index size = ",aa[1,2]
PROCEDURE TestIt(fIndexBefore as Boolean, num as Integer)
ns=SECONDS()
CREATE table test (name c(100))
IF fIndexBefore
INDEX on name TAG name
ENDIF
FOR i = 1 TO num
IF .t.
INSERT INTO test VALUES (MakeRandString(7))
ELSE
INSERT INTO test VALUES ("testing")
ENDIF
ENDFOR
IF !fIndexBefore
INDEX on name TAG name
ENDIF
RETURN SECONDS()-ns
PROCEDURE MakeRandString(nLen as Integer)
LOCAL i,cStr
cStr=""
FOR i = 1 TO nLen
cStr=cStr+CHR(RAND()*26+65)
ENDFOR
RETURN cStr
Comments
- Anonymous
February 01, 2006
Interesting. Thanks for the explanation.
As for the SYS(2015) question: each call to SYS(2015) produces a result that is distinct in the right-most character(s), the first character is always the same, and the generated values are all consecutive. In a quick test where SYS(2015) was called 10000 times, the first 7 characters of all 10000 values I got back were identical. I think that alone makes SYS(2015) a poor choice for a real-world test. However, there were a couple other differences. The SYS(2015) values can be indexed more efficiently than a more random set of values, making the timing results less applicable to a real-world scenario. I also noticed that the CDX size of the table stuffed with SYS(2015) values was considerably smaller than the CDX for a more random set of values of the same length. - Anonymous
March 09, 2006
On machine - P4 2.4 with 512Mb Ram
i got 2.957 and 1.996 - 1.48 times faster - Anonymous
December 01, 2006
Does this apply to views as well? I can't seem to see a performance difference.