SQL Order of execution

Did you know that the SQL order of execution is not the same when you write than when the query is executed in the compiler?\

Learning the basics is essential to understand how you can optimize further your code once you start looking for optimizations.

Let’s check it with one example (imagine the result will be millions of rows)
This is how you would write a simple query:

SELECT *
FROM customers c
JOIN departments d
ON c.department_name = d.department_name
WHERE age > 16
GROUP BY last_name
ORDER BY upload_time 
LIMIT 100

However this is the order of how the compiler would execute the query:

1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY
7. LIMIT

What conclusions can we extract from this basic analysis if we would like to optimize the query (as querying millions of records takes its time)

Compiler do not SELECT everything first and later make filters with WHERE and HAVING, but the opposite, first it filters and later SELECTs it.
Also the importance of performing the necessary joins as well as having the bigger filter possible in the WHERE clause or in the early stages of the compiler. With this we ensure the least amount of rows in the early stages.
This also gives us an idea of how we can limit the number of rows using LIMIT clause which happens at the end of the execution and if no filter added we will make LIMIT of millions of records. If instead we filter with WHERE clause instead of using LIMIT clause, the performance of the query will be better.
In summary, the least rows we can get at each step of the execution until the end, the better.

Example to illustrate this:

SELECT with ORDER BY and LIMIT

SELECT name 
FROM customers
ORDER BY customer_id
LIMIT 100

SELECT with WHERE and ORDER BY

SELECT name 
FROM customers
WHERE customer_name = 'Jane'
ORDER BY customer_id

Let’s suppose the result of both queries is 100 records(imagine there are 100 Jane’s in our dataset).
The second query will be faster, but why?
Let’s analyze both queries.
In the second query:

Firstly the compiler filters with WHERE clause - the result is 100 rows out of 10 millions.
Later the compiler SELECT 100 rows
The compiler ORDER BY with the 100 rows we selected. However, in the first query:
The compiler does not make any filter.
The compiler SELECT 10 million rows
The compiler ORDER BY the 10 millions of rows
The compiler makes the filter with LIMIT giving us 100 rows.

Same result, but ORDER BY in the first query is a very expensive as it’s playing with 10 million rows and not with 100, thus, giving us a slower result. Now, imagine this with a more complicated query with multiple joins, subqueries and CTE’s!

Do you want to know more about SQL optimizations?
https://wetrustindata.com/sql_optimizations/