Do You Want to Learn SQL?

Take our free eight day course, where I'll teach you in simple to understand English all you need to get started learning SQL.

-
DAYS
-
HOURS
-
MINUTES
-
SECONDS

Free Beginners SQL Course!

Joins vs Subqueries SQL Puzzle

·

·

, ,

In this puzzle, we’re going to learn about joins vs subqueries.  In many queries you can substitute joins and subqueries.  Yet, since each has their strengths, it isn’t wise to do so.  Once you have worked through this puzzle you see some of the benefits and disadvantages of joins vs subqueries.

Solving puzzles is a great way to learn SQL.  Nothing beats practicing what you’ve learned.  Once you have figured out the puzzle, post you answer in the comments so we all can learn from one another.

SQL Puzzle Question – Joins vs Subqueries

The system admin would like a report of active users.  Can you write a report that returns the name, phone number and most recent date for any user that has logged in the 30 days since March14th, 2017?

(you can tell a user has logged in if the action field in UserHistory is set to “Logged On”).

User

  • userID
  • name
  • phoneNumber

UserHistory

  • userID
  • actionDate
  • action

Every time a user logs in a new row is inserted into the UserHistory table with userID, current date and action (where action = “Logged On”).

Question #1 – Write a SQL query without using Subqueries.

Question #2 – Write the SQL using subqueries.

As you answer the questions think about the differences of  joins vs subqueries.  Which technique is better suited to solve the problem?  Can you see advantages to either case?

To get you started, download this script which contains the table definitions and some test data.

Answer to Question #1 – INNER JOIN

Like most queries I write, I took this one step by step.  Let me show you what I mean by walking you through my thought process.

Regardless of how this query turned out, I knew I would have to figure out how to get the logins 30 days since 2017-03-14.  To do this I wrote a simple query that used the DATEDIFF function to calculate the number of days from the actionDate to the 14th.  If the days were less than or equal to 30, I knew the entry was within the window.

SELECT *,
       DATEDIFF(DAY, actionDate, '2017-03-14')
FROM   @UserHistory
WHERE  action = 'Logged On'
       AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30;

Read More: SQL Joins – The Ultimate Guide >>

DATEDIFF Coding Hack

You’ll see that I included the result of the DATEDIFF in query result.  I did this because I always get the parameter mixed up!  By displaying the result, I could verify that I had the formula correct; therefore, being used in my WHERE clause as expected.

Here are is what I saw:

DATEDIFF result

Now that I knew I could get actionDate within the window, I proceeded to join the UserHistory with User so I could include the PhoneNumber in my results. I joined the tables by userID.

SELECT U.Name,
       U.PhoneNumber,
       UH.actionDate AS RecentLogonDate
FROM   @User AS U
       INNER JOIN @UserHistory AS UH
       ON U.userID = UH.userID
WHERE  action = 'Logged On'
       AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30;

Expected Duplicates

Given there are several UserHistory entries per user, I would expect to see duplicates.  I’ve highlighted some of those below:

Duplicate Entries Due to Join

The last task which remains is to only show the most recent login date for each user.  For Bob, this would be 2017-03-02.

Notice, that for each user, the most recent date is the maxim date for that user.  This is a big hint, as it points us towards using aggregate functions such as MAX.

Joins versus Subqueries - GROUBYBY hint

So, at this point we can use GROUP BY to calculation the maximum date for each user.

SELECT   U.Name,
         U.PhoneNumber,
         MAX(UH.actionDate) AS RecentLogonDate
FROM     @User AS U
         INNER JOIN @UserHistory AS UH
         ON U.userID = UH.userID
WHERE    action = 'Logged On'
         AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30
GROUP BY U.Name, U.phoneNumber;

Whose result is

 Final INNER JOIN Result

Key Take Aways

There are a couple of takeaways from this:

  • First, here is no shame is doing a query to get intermediate results. When using formulas, such as DATEDIFF, make sure you’re getting the calculation you expect.
  • Second, it can take more than one technique to solve a problem. In our case we had first join the items, and once we had the result use GROUP BY to get the most recent date.
  • Last, keep it simple! At first I thought I would have to do another join so I could compare dates, but I realized that was going to get complicated.  Also, I figured I could partition the data and use a window function, but that seemed like overkill.  Once I realized I simply had to calculate the MAX, I realized GROUP BY would work.

I’m sure there are other ways to solve this puzzle.  So, what answer did you get?   Please share in the comments.

Answer to Question #2 – Subquery

Let’s look at the second piece of the puzzle when it comes to joins vs subqueries; the subquery.

This query is elegant in that IN operator naturally remove duplicates, but gets ugly as another sub query is needed to pull in the most recent date.

What you’ll see is that subqueries are really great when you need to return rows from one table based on the existence of one or more conditions in another, but not so swift at combining data from one table with another.

Read More: SQL Subqueries – The Ultimate Guide >>

To show you what I mean, check out this query which shows the user and phone number for an actionDate within the windows:

SELECT U.Name,
       U.PhoneNumber
FROM   @User AS U
WHERE  U.UserID IN (SELECT UserID
                    FROM   @UserHistory
                    WHERE  action = 'Logged On'
                    AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30);

Natural Uniqueness via Subquery

In this case the we naturally get a unique list of name and phone numbers as we are listing users.

The uniqueness is guaranteed by the fact the table naturally contains unique value (yes, I know I didn’t define a Primary Key on the table, but hopefully you see it could be one).

The subquery in the WHERE clause compiles a list of userID’s whose actionDate falls within the 30-day window.

Joins versus Subqueries - Subquery but no ActionDate

Of course, our query is only partially written, as it doesn’t display the latest actionDate.

If I could dream  a bit (the dreamy bits are in blue) I would love to be able to write a query like:

SELECT U.Name,
       U.PhoneNumber,
       (SELECT Max(UH.actionDate)
        FROM   @UserHistory AS UH
        WHERE  action = 'Logged On'
        AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30
AND    U.userID = UH.userID) AS RecentLogonDate
FROM   @User AS U
WHERE  DATEDIFF(DAY, RecentLogonDate, '2017-03-14') <= 30
       AND RecentLogonDate IS NOT NULL

But this isn’t a properly written query, and SQL gets mad!

Msg 207, Level 16, State 1, Line 62
Invalid column name 'RecentLogonDate'.
Msg 207, Level 16, State 1, Line 63
Invalid column name 'RecentLogonDate'.

Forced To Use Two Subqueries!

What makes our subquery ugly is being forced to use two subqueries to retrieve the most recent actionDate.  To do this you must essentially repeat the query, but now as a correlated sub query.  The correlation is show in red.  It just the match of the User table UserID to that in UserHistory.

SELECT U.Name,
       U.PhoneNumber,
       (SELECT Max(UH.actionDate)
        FROM   @UserHistory AS UH
        WHERE  action = 'Logged On'
               AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30
               AND U.userID = UH.userID) AS RecentLogonDate
FROM   @User AS U
WHERE  U.UserID IN (SELECT UserID
                    FROM   @UserHistory
                    WHERE  action = 'Logged On'
                    AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30);

Conclusion

Now that you’ve seen the differences of joins vs subqueries I’m hoping you have a better understanding on when to use one versus the other.  To summarize,  joins excel at combining data from two tables, subqueries are best when testing for the existence of a value from one table found in another.

What are your experiences with joins versus subqueries?  Which do you prefer to use when?  Please share in the comments!

Related Articles

More from the blog


MySQL PostgreSQL SQLite SQL Server