SQL Fundamentals

Key SQL skills

Here are key questions and answers when assessing skill of SQL

Key questions and answers

How would you find duplicate records in a table?

This strongly depends on the data in the SQL database however the code is generally the same.

To find the number of duplicate records based on multiple columns:

SELECT first_name, last_name, email, COUNT(*) AS duplicate_count

FROM customers

GROUP BY first_name, last_name, email

HAVING COUNT(*) > 1;

Produces a table which shows us which (if any) records that have a duplicate count above 1.

Write a query to join two tables and filter results based on a specific condition.

The example I will use is between a customer and orders table, joining the tables where the order amount is greater than £100.

SELECT 
    [customers].customer_id,
    [customers].first_name,
    [customers].last_name,
    [customers].email,
    [orders].order_id,
    [orders].order_date,
    [orders].total_amount
FROM customers
JOIN orders ON [customers].customer_id = [orders].customer_id
WHERE [orders].total_amount > 100
ORDER BY [orders].order_date DESC;

The customer_id is the primary key in the customer table but a foreign key in the orders table with order_id being the primary key in the orders table. The output of this would be a table showing all the related customers and their orders which were over £100 sorted by order date, descending.

Explain the difference between ‘INNER JOIN’, ‘LEFT JOIN’, ‘RIGHT JOIN’ and ‘FULL OUTER JOIN’.

All joins in SQL combine data from different tables, based on a related column, for example, customer_id.

INNER JOIN

Only returns matching rows between both tables. Exclude rows that have no match in either table.

SELECT [customers].*, [orders].*
FROM customers
INNER JOIN orders
ON [customers].id = [orders].id;

LEFT JOIN

Returns all the records from the left table (customers) and only matching records from the right table (orders). For non-matching records, NULLs are returned for the right table records.

SELECT [customers].*, [orders].*
FROM customers
LEFT JOIN orders
ON [customers].id = [orders].id;

RIGHT JOIN

Returns all the records from the right table (orders) and only matching records from the left table (customers). For non-matching records, NULLs are returned for the left table records.

SELECT [customers].*, [orders].*
FROM customers
RIGHT JOIN orders
ON [customers].id = [orders].id;

FULL OUTER JOIN

Returns all the records that have a match in either table. Records with no match will have NULLs in the missing column values.

SELECT [customers].*, [orders].*
FROM customers
FULL OUTER JOIN orders
ON [customers].id = [orders].id;

How do you optimise a slow running query?

Depending on the database there are a number of methods to optimise a slow running query:

Index use

If the tables are small or have a low cardinality (low number of unique values) then indexes are not helpful. However, for high-cardinality and large tables they can avoid the slow process of a while table scan during a query. Here are some key indexes to use:

Primary key index (clustered index): When a primary key is created, a clustered index is created, sorts the table based in this key. Faster searches using columns such as customer_id.

CREATE TABLE customers (
    customer_id INT PRIMARY KEY, 
    name VARCHAR(100)
);

Unique index: Useful to unique values in columns, such as email addresses for customers. Prevents duplicate emails in this instance.

CREATE UNIQUE INDEX idx_email ON customers(email);

Standard index: This is a pointer to a full row. Useful for queries looking for specific numbers such as order_quantity = 100 or customer_id = 2495.

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

Composite index: This indexes multiple columns together if there are queries that filter/sort by those multiple columns.

CREATE INDEX idx_name_dob ON customers(last_name, first_name, dob);

Full-text index: Enables fuzzy searches (similar words) in large text fields (varchar/text) to speed up queries.

CREATE FULLTEXT INDEX ON products(product_review);

Data retrieval

Avoid using SELECT * as this retrieves all columns where most of the time, only a few columns are likely needed. Therefore, SELECT the columns that are relevant to the query where possible.

SELECT order_id, order_date, total_amount FROM orders WHERE customer_id = 5;

Filtering data

Use WHERE to reduce the dataset retrieved however do not filter with functions as this prevents index usage which can slow down the query

SELECT * FROM orders WHERE YEAR(order_date) = 2024;
SELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';

The second is faster as the WHERE YEAR(order_date) needs to calculate the YEAR for each row as only the order_date has been selected, stopping the use of indexes of the whole table.

Check data types

Ensure that columns that use only numbers use INTEGER instead of VARCHAR and dates use DATE/TIME instead of VARCHAR. This reduces storage and improves indexing, this speeding up queries.

ALTER TABLE orders MODIFY COLUMN customer_id INT;

Partition large tables

If we have large data tables, it is useful to separate them by date, a key or region into other tables so we are only querying specific, smaller tables.

CREATE TABLE orders_2024 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

Use caching

If a large query is being run multiple times, we can use materialised view to store these recomputed results.

CREATE MATERIALIZED VIEW top_customers AS
SELECT customer_id, COUNT(*) AS total_orders
FROM orders
GROUP BY customer_id;

Use SET SHOWPLAN (SQL server)

This shows how the query is executed, allowing us to see if the query takes a long time and to identify any full table scans or missing indexes that could be added.

SET SHOWPLAN SELECT order_id, total_amount FROM orders WHERE customer_id = 5;

Optimise joins

Keeping indexed columns in joins can help query speed but also using INNER JOIN instead of OUTER JOIN where possible to reduce NULL values and rows.

How would you use a subquery to filter results based on a condition in another table?

Subqueries are very effective when you wish to use related tables to filter or search through the records.

Here are two examples:

HAVING

Here the SUM(total_amount) determines the total spent per customer which is compared against the average over amount in a subquery which returns only customers which have spent more than the average order amount.

SELECT customer_id, SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(total_amount) > (
    SELECT AVG(total_amount) FROM orders
);

SELECT

Here the names and customer IDs are taken and filtered using a subquery where a every customer_id primary key matched with the customer_id in the orders table is counted, which counts the number of orders each customer has made.

SELECT
    [customers].customer_id,
    [customers].name,
        (SELECT COUNT(*) FROM orders WHERE [orders].customer_id = [customers].customer_id) AS  total_orders
        FROM customers;

WHERE

Here we select all the records from the orders tables but in a subquery, only selects the customers which live in London, filtering one table by using a value from another.

SELECT * FROM orders
WHERE customer_id IN (
    SELECT customer_id FROM customers WHERE city = 'London'
);

EXISTS

(When checking a table, EXISTS is more efficient than IN as it stops checking after the first match where IN always checks the whole table).

Here all customers are selected but in a subquery checks if there is at least one instance of customer_id matching in the orders table, showing this customer has ordered before. Only shows the customers who have placed at least one order.

SELECT * FROM customers
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE [orders].customer_id = [customers].customer_id
);

Explain how you would use a window function to calculate a running total or moving average.

Window functions perform calculations across a window (subset of rows) without grouping and collapsing all the rows at the end of the calculations. Aggregate functions such as SUM() or COUNT() return a single value and close the selected rows.

Running total using SUM()

Here the SUM(total_amount) calculates the sum of a customer’s order where the OVER / PARTITION resets the running total for each customer and the ORDER BY ensures the running total increases row by row in date order.

SELECT 
    order_id,
    customer_id,
    order_date,
    total_amount,
    SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total
FROM orders;

Moving average using AVG()

Here a 3-day moving average on customer orders is created. AVG(total_amount) is the average sales of the selected rows. The PARTITION BY shows the moving average resets after each customer. The ROWS BETWEEN 2 PRECEDING AND CURRENT ROW indicates to use the 2 previous rows and the current as the window to process the moving average.

SELECT 
    order_id,
    customer_id,
    order_date,
    total_amount,
    AVG(total_amount) OVER (
        PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS moving_avg
FROM orders;

What are CTEs, and how would you use them to simplify complex queries?

CTEs are common table expressions are results of a query that are temporarily stored to be used in other queries. This makes complex queries more readable and avoids repeating code if a query is repeated within another query.

Here, two CTEs are created, total_orders (which totals the orders of each customer) and ranked_customers (which uses the total_orders CTE to rank the customers based on their total spending, using a variable total_spent from the first CTE). The last line of code uses the ranked_customers CTE to find the top 3 customers by total spending.

WITH total_orders AS (
    SELECT customer_id, SUM(total_amount) AS total_spent
    FROM orders
    GROUP BY customer_id
),
ranked_customers AS (
    SELECT customer_id, total_spent,
           RANK() OVER (ORDER BY total_spent DESC) AS ranking
    FROM total_orders
)
SELECT customer_id, total_spent FROM ranked_customers WHERE ranking <= 3;