Working with SQL queries can be incredibly tricky, especially if you’re new to the language. Even experienced SQL users can be stumped by complex queries. However, with a few tips and tricks, you can quickly become an SQL expert.
In this step-by-step guide, you’ll learn how to solve tricky SQL queries and become an SQL wizard.
You’ll learn the best practices for crafting optimized queries, how to debug queries that don’t work, and how to use common features like joins and subqueries.
With these tools in hand, you’ll be able to tackle any SQL query with confidence. Whether you’re a beginner or a seasoned SQL coder, this guide will help you become a master of SQL queries.
Table of Contents
1. Classes More Than 5 Students @ LeetCode
There is a table
courses with columns: student and class. Please list out all classes which have more than or equal to 5 students. For example, the table:
| student | class |
| A | Math |
| B | English |
| C | Math |
| D | Biology |
| E | Math |
| F | Computer |
| G | Math |
| H | Math |
| I | Math |
| class |
| Math |
The students should not be counted duplicate in each course.
Walking Through My Thinking
Even for absolute beginners, this is an easy question: the underlying idea is to filter classes with 5 or more students and do something like this:
GROUP BY class
You will bump into an error message if you submit the above code at LeetCode.
The tricky part is that there may be duplicates, and the last line of code, the HAVING COUNT() statement, fails to consider such a scenario. Instead, we should use the DISTINCT method inside the chunk to specify the output.
In fact, the question has an additional note that specifies what to do with duplicates.
A piece of advice is to ask for clarification questions if you are doing a SQL interview for DS positions. Smooth communication in any interview goes a long way.
# Write your MySQL query statement below SELECT class FROM courses GROUP BY class HAVING COUNT(DISTINCT student)>=5
We are given a table consisting of two columns, Name, and Profession. We need to query all the names immediately followed by the first letter in the profession column enclosed in parenthesis.
SELECT CONCAT(Name, ’(‘, SUBSTR(Profession, 1, 1), ’)’) FROM table;
Since we need to combine the name and profession we can use
CONCAT. We also need to have only one letter inside the parenthesis. Hence we will use
SUBSTRand pass the column name, start index, end index. Since we need only the first letter we will pass 1,1(start index is inclusive and the end index is not inclusive)
3. Game Play Analysis II @ LeetCode
| Column Name | Type |
| player_id | int |
| device_id | int |
| event_date | date |
| games_played | int |
(player_id, event_date) is the primary key of this table.
This table shows the activity of players of some game.
Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on some day using some device.
Write a SQL query that reports the device that is first logged in for each player.
The query result format is in the following example:
| player_id | device_id | event_date | games_played |
| 1 | 2 | 2016-03-01 | 5 |
| 1 | 2 | 2016-05-02 | 6 |
| 2 | 3 | 2017-06-25 | 1 |
| 3 | 1 | 2016-03-02 | 0 |
| 3 | 4 | 2018-07-03 | 5 |
| player_id | device_id |
| 1 | 2 |
| 2 | 3 |
| 3 | 1 |
Walking Through My Thinking
OMG, this question appears to be so easy but deceivingly difficult.
Here is why.
The result table contains two columns: player_id and device_id. We can use a WITH() clause to filter out the first log-in date, which can be obtained using MIN() command for each player and then write an outer query to find these two columns.
In code, we can do something as follow:
WITH first_log AS (
SELECT player_id, device_id, MIN(event_date)
GROUP BY player_id
)SELECT player_id, device_id
Again, correct result if run it, but an error message if submit. Try it here: LeetCode.
Before reading on, I’d suggest a short break and take a look at the code:
why doesn’t it work?
The answer is each player has his/her own first log-in date, and we have to filter cases using these two columns together: player_id and event_date. In other words, each player has his/her unique minimal date.
# Write your MySQL query statement below SELECT player_id, device_id
WHERE (player_id, event_date) IN (
select player_id, MIN(event_date)
group by player_id
Note: don’t forget to include a pair of parenthesis.
Tina was asked to compute the average salary of all employees from the EMPLOYEES table she created but realized that the zero key in her keyboard is not working after the result showed a very less average. She wants our help in finding out the difference between miscalculated average and actual average.
We must write a query finding the error( Actual AVG — Calculated AVG).
SELECT AVG(Salary) - AVG(REPLACE(Salary, 0, ’’)) FROM table;
A point to note here is that we have only one table that consists of actual salary values. To create the error scenario we use
REPLACE to replace 0’s. We will pass the column name, value to be replaced, and the value with which we will replace the
REPLACE method. Then we find the difference in averages using the aggregate function
We are given a table, which is a Binary Search Tree consisting of two columns Node and Parent. We must write a query that returns the node type ordered by the value of nodes in ascending order. There are 3 types.
- Root — if the node is a root
- Leaf — if the node is a leaf
- Inner — if the node is neither root nor leaf.
Upon initial analysis, we can conclude that if a given node N has its corresponding P-value as NULL it is the root. And for a given Node N if it exists in the P column it is not an inner node. Based on this idea let us write a query.
SELECT CASE WHEN P IS NULL THEN CONCAT(N, ' Root') WHEN N IN (SELECT DISTINCT P from BST) THEN CONCAT(N, ' Inner') ELSE CONCAT(N, ' Leaf') END FROM BST ORDER BY N asc;
We can use
CASE which acts as a switch function. As I mentioned if P is null for a given node N then N is the root. Hence we used
CONCAT for combining the node value and label. Similarly, if a given node N is in column P it is an inner node. To get all nodes from column P we wrote a subquery which returns all the distinct nodes in column P. Since we were asked to order the output by node values in ascending order we used the
ORDER BY Clause.
We are given a transaction table that consists of transaction_id, user_id, transaction_date, product_id, and quantity. We need to query the number of users who purchased products on multiple days(Note that a given user can purchase multiple products on a single day).
To solve this query, we cannot directly count the occurrence of user_id’s and if it is more than one return that user_id because a given user can have more than one transaction on a single day. Hence if a given user_id has more than one distinct date associated with it means he purchased products on multiple days. Following the same approach, I wrote a query. (Inner query)
SELECT COUNT(user_id) FROM ( SELECT user_id FROM orders GROUP BY user_id HAVING COUNT(DISTINCT DATE(date)) > 1 ) t1
Since the question asked for the number of user_ids and not the user_id’s itself we use
COUNT in the outer query.
We are given a subscription table which consists of subscription start and end date for each user. We need to write a query that returns true/false for each user based on the overlapping of dates with other users. For instance, If user1’s subscription period overlaps with any other user the query must return True for user1.
Upon initial analysis, we understand that we must compare every subscription against every other one. Let us consider start and end dates of userA as
endA, similarly for userB,
startB then we can say the two date ranges overlap. Let us take two examples. Let us compare U1 AND U3 first.
startA = 2020–01–01
endA = 2020–01–31
startB = 2020–01–16
endB = 2020–01–26
Here we can see
startA(2020–01–01) is less than
endB(2020–01–26) and similarly, endA(2020–01–31) is greater than
startB(2020–01–16) and hence can conclude that the dates overlap. Similarly, if you compare U1 and U4 the above condition fails and will return false.
We must also ensure that a user is not compared to his own subscription. We also want to run a left join on itself to match a user with each other user that satisfies our condition. We will create two replicas s1 and s2 of the same table now.
SELECT * FROM subscriptions AS s1 LEFT JOIN subscriptions AS s2 ON s1.user_id != s2.user_id AND s1.start_date <= s2.end_date AND s1.end_date >= s2.start_date
Given the conditional join, a user_id from s2 should exist for each user_id in s1 on the condition where there exists an overlap between the dates.
We can see there exists another user for each user in case the dates overlap. For user1 there are 2 rows indicating that he matches with 2 users. For user 4 the corresponding id is null indicating that he does not match with any other user.
Wrapping it all together now, we can group by the s1.user_id field and just check if any value exists true for a user where s2.user_id IS NOT NULL.
SELECT s1.user_id , (CASE WHEN s2.user_id IS NOT NULL THEN 1 ELSE 0 END) AS overlap FROM subscriptions AS s1 LEFT JOIN subscriptions AS s2 ON s1.user_id != s2.user_id AND s1.start_date <= s2.end_date AND s1.end_date >= s2.start_date GROUP BY s1.user_id
We used the
CASE clause to label 1 and 0 depending on the s2.user_id value for a given user. The final output looks like this –
8. Sellers With No Sales @ LeetCode
| Column Name | Type |
| customer_id | int |
| customer_name | varchar |
customer_id is the primary key for this table.
Each row of this table contains the information of each customer in the WebStore.
| Column Name | Type |
| order_id | int |
| sale_date | date |
| order_cost | int |
| customer_id | int |
| seller_id | int |
order_id is the primary key for this table.
Each row of this table contains all orders made in the webstore.
sale_date is the date when the transaction was made between the customer (customer_id) and the seller (seller_id).
| Column Name | Type |
| seller_id | int |
| seller_name | varchar |
seller_id is the primary key for this table.
Each row of this table contains the information of each seller.
Write an SQL query to report the names of all sellers who did not make any sales in 2020. Return the result table ordered by
seller_name in ascending order. The query result format is in the following example.
| customer_id | customer_name |
| 101 | Alice |
| 102 | Bob |
| 103 | Charlie |
| order_id | sale_date | order_cost | customer_id | seller_id |
| 1 | 2020-03-01 | 1500 | 101 | 1 |
| 2 | 2020-05-25 | 2400 | 102 | 2 |
| 3 | 2019-05-25 | 800 | 101 | 3 |
| 4 | 2020-09-13 | 1000 | 103 | 2 |
| 5 | 2019-02-11 | 700 | 101 | 2 |
| seller_id | seller_name |
| 1 | Daniel |
| 2 | Elizabeth |
| 3 | Frank |
| seller_name |
| Frank |
Daniel made 1 sale in March 2020.
Elizabeth made 2 sales in 2020 and 1 sale in 2019.
Frank made 1 sale in 2019 but no sales in 2020.
Walking Through My Thinking
First off, this is a rather simple question. I mean it this time. What turns me down and justifies its inclusion here is its lengthy description.
What to do with these three tables?
Where should I start?
There is too much information, and too much information is no better than no information.
My coping strategy is to break the question down into pieces and ask two big questions:
What does the question ask for?
What information is available so far?
The first step would be to take a look at the above Result table, if provided, and, if not, ask the interviewer what would be like if things work out as expected. Again, good communication is King!
Then, let’s check what available information. As for this query, the Result table only contains one column (seller_name), and it only appears in the Orders table, which also contains sale_date and order_id information. Either of these two additional variables would be sufficient to identify the year of each order.
Now, we have a good understanding of the question and have collected all the necessary information. There is no need to check the first table (Customer).
The final step is to write up the query sequence.
The question asks for sellers with no sales in 2020. If we use a WHERE clause to filter the rows for sales not in 2020, then we would eliminate cases that have sales both in 2020 and beyond, like in Elizabeth‘s case.
The following code does not work:
WHERE LEFT(sale_date,4) != 2020
It outputs two cases: Frank (no sale in 2020) and Elizabeth (sales both in 2020 and 2019).
For a “no-sales-in-2020” question, we can find out the seller names that have sales in 2020, as a first step, and then use an outer query to filter them out in the following step.
WHERE seller_id NOT IN (
SELECT DISTINCT seller_id
WHERE LEFT(sale_date, 4) = ‘2020’
ORDER BY seller_name
9. Second Highest Salary @ LeetCode
Write a SQL query to get the second highest salary from the
| Id | Salary |
| 1 | 100 |
| 2 | 200 |
| 3 | 300 |
For example, given the above Employee table, the query should return
200 as the second highest salary. If there is no second highest salary, then the query should return
| SecondHighestSalary |
| 200 |
Walking Through My Thinking
Actually, this is a straightforward question. The reason for its inclusion is because I want to introduce two methods: IFNULL() and OFFSET.
The question asks for the second highest salary, which reminds us of using an OFFSET method to skip the first row. Also, it returns null if there is no such value. In so doing, we can use IFNULL() function.
As a side note, the function has the following form: IFNULL(statement, NULL). If there is no null value, run the statement as usual; if there is a null value, return NULL or any other value as specified.
SELECT IFNULL( (SELECT DISTINCT Salary FROM Employee ORDER BY Salary DESC LIMIT 1 OFFSET 1), null) AS SecondHighestSalary
We are given two tables.
Accounts table — Each row of this table contains the account number and name of each user in the bank.
Transactions table — Each row of this table contains the transactions of all accounts. Amount is positive if the user received money and negative if they transferred money. All the accounts start with an initial balance of zero.
We need to write an SQL query to return the name and balance of all users having a balance greater than 10,000.
Balance = Total received money-Total transferred money
Firstly we need to join the two tables. Since we need to find the balance of each account we can use the
GROUP BY clause on the
Account IDcolumn. We can find the sum of amount column grouped by
Account ID and filter rows having sum>10000.
a.name AS name,
sum(b.amount) AS balance
FROM Users AS a JOIN Transactions AS b
on a.account = b.account
GROUP BY a.account
HAVING balance > 10000;
Given a table of students and their GRE test scores, write a query to return the two students with the closest test scores and their score difference. If there exists more than one pair, sort their names in ascending order and then return the first resulting pair.
This requires some creative thinking in SQL. Since there is only one table with two columns we need to self-reference different creations of the same table. We can solve these kinds of problems by visualizing two tables having the same values.
Let us have two same copies of the above scores table named s1 and s2. Since we need to compare every student with every other student. We can perform inner join by setting:
scores AS s1 INNER JOIN scores AS s2
ON s1.ID != s2.ID
This way we are comparing every student with every other student and are also avoiding comparing a student to his/herself. However, if we run the final query using the above join condition, observe the output.
Since we are comparing every student in s1 with every other student in s2, duplication is happening. And if there exist a million rows our query will be inefficient. To optimize the query let us add a condition s1.id > s2.id to ensure that comparison happens only once.
Therefore all we need to do now is subtract each score from each other score and order this difference in ascending order using the ORDER BY clause. Since we need the closest score, we will print only the first row using the LIMIT clause.
s1.name AS s1_name
, s2.name AS s2_name
, ABS(s1.score - s2.score) AS score_diff
FROM scores AS s1
INNER JOIN scores AS s2
ON s1.id != s2.id
AND s1.id > s2.id
ORDER BY score_diff ASC, s1_name ASC
Given the above two tables employees and departments, select the top department which consists of the highest percentage of employees who earn over 1500 in salary and have at least 2 employees.
We can break down the given question into separate clauses.
1) Top department.
2) % of employees making over 1500 in salary.
3) Department must have at least 2 employees.
Before solving the above clauses let us have a combined representation of the two tables where each employee is associated with their department names. Since both tables have a common column
department_id we can perform a join using
employees AS e INNER JOIN departments AS d
ON e.department_id = d.id
Now that we have this full representation, we can start filtering, aggregating to get our output. We know that we need to calculate the total number of employees that are making over 1500 by each department. We can use the GROUP BY clause on dept_name since we need one row for each department.
To find the percentage of people earning over 1500 we can use —
(Number of people making over 1500) / (Total number of people in that department)
Finally, we need to find the count the number of Employee IDs to filter departments having more than 2 employees.
d.Dept_Name AS d_name,
SUM(CASE WHEN e.salary > 1500 THEN 1 ELSE 0 END)/COUNT(DISTINCT e.id) AS pct_above_1500,
COUNT(DISTINCT e.id) AS number_of_employees
FROM employees e JOIN departments d ON e.dept_id = d.dept_id
GROUP BY d_name
HAVING number_of_employees >= 2
ORDER BY pct_above_1500 DESC
We use the SUM and CASE clause to find the number of employees with a salary greater than 1500.
In the above table, we are given scores of IDs. W need to write a SQL Query to output the rank of these scores. Incase, the scores are tied, the same rank must be assigned to both the scores.
Firstly, let us segregate the distinct scores and create another copy of the input table.
For each score in the S: Scores Table, we want to find out the number of Scores in the S2:Scores table that are greater than or equal. We can group by ID and compare the scores.
This is nothing but the rank of that particular score!
SELECT S.Score, COUNT(S2.Score) AS Rank FROM Scores S,
(SELECT DISTINCT Score FROM Scores) AS S2
GROUP BY S.Id
ORDER BY S.Score DESC;
The above table consists of Visitor ID, Visitor Date, and number of Visitors on that particular date. No two same dates exist. Write an SQL query to display three or more consecutive ID’s, having visitors more than or equal to 100 sorted by their visiting date. IDs will be consecutive and unique.
You may argue that ID 2 and 3 also have more than 100 visitors and are not displayed in the output. The main condition is to display 3 or more IDs with visitors more than 100 here. Hence, 5,6,7,8 are displayed.
I found this question to be tricky as the title of this article suggests 🙂 Similar to Query 2 we need to create a self join on the stadium table. We can compare IDs and compute our results. A given row s1 can be the first, middle, or last one in the 3 consecutive rows.
- s1 in the beginning: (s1.id + 1 = s2.id AND s1.id + 2 = s3.id)
- s1 in the middle: (s1.id — 1 = s2.id AND s1.id + 1 = s3.id)
- s1 in the end: (s1.id — 2 = s2.id AND s1.id — 1 = s3.id)
SELECT s1.* FROM stadium AS s1, stadium AS s2, stadium as s3 WHERE ((s1.id + 1 = s2.id AND s1.id + 2 = s3.id) OR (s1.id - 1 = s2.id AND s1.id + 1 = s3.id) OR (s1.id - 2 = s2.id AND s1.id - 1 = s3.id) ) AND s1.people>=100 AND s2.people>=100 AND s3.people>=100GROUP BY s1.id ORDER BY s1.visit_date
Practice is the key to mastering SQL. This article explains 15 tricky questions and how they can be solved. There are a variety of ways to write a SQL query. This is one of SQL’s specialties. I hope you learned something new today!