My coworker found this interesting little SQL Server behavior today that was quite startling at first glance. It's like punching 2 + 2 into your calculator and having the result come back as 5. After pin-pointing the cause for the behavior, it ALMOST makes sense... except for the fact that it doesn't make sense at all. It involves a randomized sub-select and table variables.This is the part of the post where I pull out the simplified example that is rather contrived, but shows the point quite clearly. Consider the following code:
[code]DECLARE @test TABLE     
	(testID int)
		
INSERT INTO @test (testID)
SELECT 1
union SELECT 2 union SELECT 3 union SELECT 4
union SELECT 5 union SELECT 6 union SELECT 7
union SELECT 8 union SELECT 9 union SELECT 10
		
SELECT testID
FROM @test   
WHERE testID IN ( SELECT TOP 5 testID 
				FROM @test
				ORDER BY newID() )
			
[/code]
Ok, nothing too crazy going on here. Declare a table variable, throw 10 records in it, and select out the five that are returned from our randomized TOP 5 sub select. Right? Wrong! Running this repeatedly returns a DIFFERENT number of records each time between 1 and 10. If we run the sub select by itself, it always returns 5 records, yet the main select seems to return whatever it feels like. Well, what's going on here? The answer is found in the execution plan: Here are the details for the circled portions:       Notice how the table scan for the sub select was processed 10 times-- that's once for every record in the main select. It returned a total of 100 rows. The sort operation also ran 10 times and since they were randomly ordered, it chose a DIFFERENT top 5 every time. Of course there is some overlap, so the actual number of unique rows varies. In case it isn't obvious yet, I'll point out that this behavior is also very bad for performance. Imagine if my table variable had 1,000 records in it. That table scan operation would have returned 1,000,000 records! So, what's the fix? One way is to convert your table variable to a temp table like so:
[code]CREATE TABLE #test
	(testID int)
		
INSERT INTO #test (testID)
SELECT 1
union SELECT 2 union SELECT 3 union SELECT 4
union SELECT 5 union SELECT 6 union SELECT 7
union SELECT 8 union SELECT 9 union SELECT 10
		
SELECT testID
FROM #test   
WHERE testID IN ( SELECT TOP 5 testID 
				FROM #test
				ORDER BY newID() )
[/code]
That code will always return no more than 5 random results at a time. Let's take a peek at its execution plan: As you can see it is very similar. The main difference is that the lower right table scan only runs ONCE, as does the Compute Scalar (calculation the newid), and the sort. The difference here is the table spool that runs 10 times on the SAME 5 records. If the column in question is unique, my second solution is to keep the table variable, but simply make the column used in the join a PRIMARY KEY.
[code]DECLARE @test TABLE     
	(testID int PRIMARY KEY)   
		
INSERT INTO @test (testID)
SELECT 1
union SELECT 2 union SELECT 3 union SELECT 4
union SELECT 5 union SELECT 6 union SELECT 7
union SELECT 8 union SELECT 9 union SELECT 10
		
SELECT testID
FROM @test   
WHERE testID IN ( SELECT TOP 5 testID 
				FROM @test
				ORDER BY newID() )
[/code]
Notice that our table scan has gone away (it's about time) and has been replaced with a sleek Clustered Index Scan that only runs ONE time. In fact, the Compute Scalar and Sort only run 1 time as well, returning only 5 records like we would have expected it to. This is one of those things I can't quite explain. I can look at the plans and see what SQL Server is doing, but I'm not sure why or if it should be classified as a bug or not. I have run these tests on SQL Server 2000 and 2008 with the same results. I have also Googled quite a bit to find an explanation of the behavior, but I can't find one. Perhaps if you know why, you can chime in. But for now, be very careful when using table variables in a sub-select.