My last blog post was originally the start of this one, but I got so carried away talking about the different types of temp tables I split it off. Today's topic is when to create indexes on SQL temp tables-- before or after you add the data. Many people out there seem to hold the same opinion on the subject. I don't like taking other people's word and I also like doing things the hard way. Therefore I created a series of tests to see which way really was fastest.Most web sites out there seem to think if you are using a temp table or global temp table (or a regular table for that matter) with indexes on one or more columns and you are going to to load a large amount of data into the table, it is faster to add the data first and THEN create your indexes. If you create the indexes first, SQL Server has to rearrange the data over and over again as the data gets added. This causes lots of page splits and fragmentation. Asking SQL Server to create the indexes after the data has been loaded into the table allows it to sort the data into the index all at once. Sounds plausible enough, but is it true? Below are the tests I devised to find out. I ran them on SQL Server 2005 Express on my home PC (Windows XP 1.8 Ghz, 1 Gig RAM, single IDE Drive) Yeah, I know-- not really server specs. You'll get over it-- I did. I needed lots of data for my test-- and for good cardinality it needed to be random. I played with several methods of generating large amounts of random strings or numbers, but finally gave up in favor of the newid() SQL function. It returns a 36 character string which I stored in a varchar(200) column in a temp table. (TODO: Try other data types) I figured 3 million records should be a healthy data set. To be fair, I figured I should give it a go with both a clustered and nonclustered index. For the fun of it, I devised two different tests: One that dumped all 3 Mil records in my temp table all at once, and another that pushed them in in spurts of 10 at a time. (300,000 spurts to be exact). I set NOCOUNT on like so:
[code]SET NOCOUNT ON[/code]
I dropped and re-created my temp table before each test like so:
[code]DROP TABLE #temp
(test varchar(200))
My index creations looked like this:
[code]CREATE CLUSTERED INDEX foo ON #temp (test)
CREATE NONCLUSTERED INDEX foo ON #temp (test)[/code]
I created 3 Million records in my temp table like so:
[code]DECLARE @counter AS int
SET @counter = 1

WHILE @counter <= 300000

	INSERT INTO#temp (test) 
	SELECT newid()
	SET @counter = @counter + 1

I used the code above directly to populate my temp table a few records a time. For the tests where I inserted all 3 Million in a single statement, I created a second temp table which I populated before hand, and then transferred all 3 million at once like so:
[code]INSERT #temp
FROM #temp2[/code]
I know my naming conventions leave something to be desired, but with counseling you should be alright. I ran each test about 3 times and took the average to help weed out anomalies. The numbers in the "before" columns are the time it took to insert the records into a temp table with the indexes already created. The numbers in the "after" columns are the time to insert the records into a non-indexed table plus the time to create the index on the populated table. The "difference" column shows how many seconds the the "after" beat the "before". For those of you bored enough to still be with me, here is what I found:
  Spurts All-In-One
  Before After Difference Before After Difference
CLUSTERED 1:30 1:15 0:14 0:51 0:52 -0:01
NONCLUSTERED 2:45 1:16 1:28 1:27 0:53 0:34
  • On average it is faster to wait and create the indexes after populating the data, but results did vary.
  • Clustered indexes are faster than non clustered indexes if created prior to populating the data.
  • Clustered and Nonclustered indexes take about the same amount of time when created on a pre-populated table.
  • The biggest difference was seen when inserting data in spurts with a Nonclustered index.
  • There was virtually no difference when doing a mass insert with a Clustered index.
  • Inserting data in spurts was across-the-board slower than a single insert. While I believe that to be true, keep in mind the spurts columns include the time spent on the WHILE loop and newid() functions, while the All-In-One columns do not since I pre-loaded the data into a second temp table.
  • There are a number of variables I didn't cover that could affect the outcome like:
    • Speed of the server
    • Size of the data
    • Number of indexes
    • Cardinality of the data
    • Sorting the data prior to inserting
    • Index fill factor (to prevent page splits)
    • Other data types