Editorials

Query Comparison

From a conversation we were having in the comments from last week I was comparing the query plan produced from using different ways to retrieve the same results from the database. Just for fun, I used a new SQL Server 2014 instance for my test, comparing two system tables, sys.sysobjects and sys.indexes. I wanted to find a list of all the records in sys.sysobjects having a record in sys.indexes. In addition, I wanted the SEMI JOIN results from sys.sysobjects. in other words, I wanted to return all of the records in sys.sysobjects containing one or more records in sys.indexes. However, I only wanted a single instance of the records from sys.sysobjects, regardless of how many records were found in sys.indexes.

I used four different queries to gather the results.

SELECT so.id

FROM sys.sysobjects so

WHERE EXISTS (

SELECT 1

FROM sys.indexes si

WHERE so.id = si.object_Id)

 

SELECT so.id

FROM sys.sysobjects so

WHERE so.id IN (

SELECT si.object_Id

FROM sys.indexes si)

SELECT DISTINCT so.id

FROM sys.sysobjects so

INNER JOIN sys.indexes si

ON so.id = si.object_Id

 

SELECT so.id

FROM sys.sysobjects so

INNER JOIN sys.indexes si

ON so.id = si.object_Id

GROUP BY so.id

The query plan for these four queries was interesting to me. The first two queries that filtered the results based on a sub-select method with either EXISTS or IN resulted in the exact same query plan. Both used MERGE joins for gathering the results, which, since the data was found in the same sort order using the clustered indexes on each table was the fastest selection possible. I found this an interesting improvement in the query engine. The IN operation used to perform less efficiently than other methods. In this case the query optimizer handled either syntax with great success. Note to self…don’t trust historic performance ideas from one SQL Server release to the next. I wonder what new performance improvements are in SQL Server 2017.

The last two queries also produced the exact same query plan. I was not surprised in this case, because DISTINCT is ultimately a simplified form of a GROUP BY clause. If you do a GROUP BY without using any aggregate formulas in the SELECT clause, it behaves much the same as using DISTINCT. I used the DISTINCT and GROUP BY implementation to return only one instance for a row from sys.sysobjects if it was found in the second table.

The difference between the two different types of queries was difficult to compare because of such small data sets. However, the query plan did state that the first two queries were almost twice as fast for execution resources. That’s probably do to using MERGE joins instead of LOOP joins.

One last thing to note, if this query was using a source table that did not have unique records, as is the case in sys.sysobjects, you would not get the correct results for a SEMI JOIN using the DISTINCT or GROUP BY method. You may be confused with that statement, so let me try and explain. If the data in the table you wish to retrieve records from has multiple rows with the same values, and it finds any row in the joined table, it should return all of the rows from the first table. If you use DISTINCT or GROUP BY, you will only return one record for the multiple rows, thus not really returning a SEMI JOIN result set.

For more information on how this is expressed in your environment, take the queries above and run them against your own master database. Then take a look at the query plans produced for comparison.

Cheers,

Ben

Facebooktwittergoogle_plusredditpinterestlinkedinmail
  • Maurice Pelchat

    The IN clause can always be easily rewritten as an exists clause. The optimizer generally produce the same for both. BTW I see often many authors using the old compatibility view when dealing with metadata. It seems to me that the design of the new ones is better (example: sys.objects vs sys.sysobjects).