MySQL performance: INNER JOIN vs. sub-select

Posted by Brad Wood

Aug 01, 2008 04:51:00 UTC

I ran into an interesting behavior with MySQL this week. I was helping someone speed up a slow page and a sizable increase in performance was achieved by simply re-arranging the SQL statement. The page was calling a SQL statement inside of a loop-- probably around 150 times on a page load. My initial idea (and still the best long-term one I think) was to gather all the information from the database in a single cfquery and not hit the database over and over. That approach would have required quite a lot more refactoring of code, so first we tried to squeeze some better performance out of the SQL already being called.This site used MySQL which I am not as familiar with compared to MSSQL, but I started by looking at an EXPLAIN statement. In MySQL the placing the keyword EXPLAIN before your select will cause the engine to output a result set which describes how the select was performed.

[code]EXPLAIN SELECT *
FROM my_table
WHERE column = 'foo'[/code]

I will start by saying the amount of information MySQL (community edition) gives you when compared to a much more robust (and expensive) DBMS is much smaller. Perhaps the Enterprise edition of MySQL is better, but there's a REASON you pay for the Enterprise stuff. I LOVE how MS Enterprise Manager graphically maps out the entire query. The select in question was basically this:

[code]SELECT count(id) AS totalcount
FROM table1
WHERE fkid IN (SELECT id FROM table2 WHERE fkid2 = 100)[/code]

You can see it used a sub-select on table2 in the where clause to limit the records from table1. Table1 had a few hundred thousand records and table2 around twenty thousand. It began performing about 30 times faster when re-wrote it like so:

[code]SELECT count(table1.id) AS totalcount
FROM table1
INNER JOIN table2 e ON table1.fkid = table2.id 
	AND table2.fkid2 = 100;[/code]

All I did was move the sub-select into an inner join. Let's take a quick look at the execution plans: Version 1 (slower):

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	PRIMARY	table1	index	{null}	fkid	4	{null}	249844	Using where; Using index
2	DEPENDENT SUBQUERY	table2	unique_subquery	PRIMARY,fkid2	PRIMARY	4	func	1	Using where

Version 2 (faster):

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	table2	ref	PRIMARY,fk1d2	fk1d2	4	const	50	Using index
1	SIMPLE	table1	ref	fkid1	fkid1	4	table2.id	4	Using index

I'm not going to get too deep into what all that means. If you want a full explanation, check out the MySQL reference here: http://dev.mysql.com/doc/refman/5.0/en/using-explain.html There's quite a bit of information there. The main difference I noticed from the plan was that the first method figured it would go through the entire table1 (about 250,000 records). The second select only expected to hit about 50 records in table1. What's a bit more interesting is that I have tried this setup both ways on MSSQL and there never seemed to be much of a difference between the two. It seems that Microsoft's query optimizer works a little differently. I don't have access to MSSQL right now or I would compare the plan there as well. Overall, simply by re-arranging that query a bit we were able to get a 20 second page load down to well under a second. I love improving performance. A fast web site is a happy web site.

Tags: Performance, SQL

13 Comments

Barney
Aug 01, 2008 09:32:20 UTC

Correlated subselects on MySQL are almost always slower than joins, though the difference in a lot of cases (small row sets, for example) it's completely irrelevant. The readability of subselects is often enormously better, however. As such, I usually default to subqueries whenever possible, and go back and refactor to joins only if needed.

And your completely right about MSSQL, it's query optimizer is enormously better that MySQL's, at least when it comes to subselects.

jeff
Aug 01, 2008 16:00:03 UTC

Yeah i figured this out a couple of weeks ago from a bug post on mySQL's website.

http://www.digitalrewind.com/blog/2008/06/11/MYSQL-Slow-Subquery-using-IN

It's amazing how much it speeds up MYSQL.

Brad Wood
Aug 01, 2008 16:43:28 UTC

@Jeff: Thanks for the link.

@Barney: Thanks for the insight. I've always found an inner join to be easier to read, but it guess it's all a matter of personal preference.

Brian Kotek
Aug 01, 2008 17:58:49 UTC

Another option is to use EXISTS where possible, as it will also usually be much faster than a correlated subquery.

Another option is using can be using an inline view, meaning a select statement that is aliased and used in joins within the FROM clause (Google "inline view" for much more). All major DBs support this and they can be very useful for performance as well as getting data in one query that would normally require multiple queries.

Brian Kotek
Aug 01, 2008 17:59:46 UTC

Whoops sorry for the bad grammar there. :-/

Barney
Aug 01, 2008 18:06:01 UTC

Inline views actually don't seem to be as performant as they should on MySQL. We were doing some largish (couple million rows) reporting queries and using inline views and they were slow. Switching to explicitly temporary tables made a HUGE difference. Unfortunately MySQL doesn't support the WITH clause like MSSQL and Oracle do, which is basically equivalent.

Brad Wood
Aug 01, 2008 18:41:54 UTC

@Barney: When you say the "with clause" are you talking abut Common Table Expressions? I don't know if I care for them from a readability stand point always, but they can make for some really useful code. Recursion is one really nice use of CTE's.

Heck, I'd be happy if MySQL supported table variables though.

@Brian: Check out this MySQL bug report: http://bugs.mysql.com/bug.php?id=4040 The comments down near the bottom explain that MySQL 4.1 "rewrites IN via EXISTS like subquery " and in MySQL 5 they "planned make hash/temporary table optimization for such queries"

Raymond Camden
Aug 02, 2008 21:12:53 UTC

I hope Brad doesn't mind, but I was the person he helped. The site was coldfusionbloggers.org and the feeds tab. The improvement was incredible.

ike
Aug 27, 2008 19:58:06 UTC

That's a question I'd wondered about for a long time, but never got around to actually testing... I'd always suspected that a subquery was likely to be slower than an equivalent join clause, though I expected that to be true irrespective of the database platform. But I guess it really does depend on the engine used to create the execution plan.

Andrew
Jul 28, 2009 16:37:54 UTC

Ok, so I've been wondering, what if I have my sub-select on the select part of the query? That would be something like this:

SELECT name, (SELECT something FROM aTable WHERE person.id = aTable.personID) as Something FROM person WHERE person.id = @theID

wich would perform better? or is it a bad practice to do what I'm doing? if so, then the 'proper' way would be to use a Join instead?

Brad Wood
Aug 01, 2009 06:37:06 UTC

@Andrew: I've always assumed it would not perform well to have a sub select in the main select clause, but I've never actually tested it.
I have often been suprised by MS SQL's ability to come up with a good execution plan even with weird SQL.
That being said, I have rarely had the need to do that. Usually a join will do the trick nicely.

MR
Aug 10, 2009 11:45:46 UTC

Thanks very much!

p
Jan 17, 2010 01:08:00 UTC

Thanks a lot for posting this. I had a simple web page that took up to 1,5 seconds to query a MySQL database of just a few hundred rows, using two subselects. Implementing the change you suggest (which I didn't previously realize was possible) made an enormous difference: it now always loads in less than 0,1 seconds.

MySQL performance: INNER JOIN vs. sub-select