Guide to SQL Window functions: When and how to use them?
Introduction
Window functions operate on a set of rows (window) and return an aggregated result hence the name “Window Functions”. Window Functions are extremely useful when we don’t want the output table to be grouped and want to retain the total number of rows as were before applying the function.
Window functions can be classified into the following three types:
The dataset that we will be using to illustrate the examples given in this code is a simple customer transaction dataset with demographic columns (State, Age) and transaction columns (Customer ID, Transaction ID, Sales) and it can be downloaded from this link
Aggregate Functions
Let’s first start with the aggregate window functions. These functions are used to find out the maximum, minimum, mean, count and total of a particular metric. In the following code example, it has been illustrated how to aggregate the sales metric for each customer, keeping the number of rows consistent.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
SELECT Customer_ID,Transaction_ID,State, Sales, SUM(Sales) OVER (PARTITION BY Customer_ID) AS Sales_total, MAX(Sales) OVER (PARTITION BY Customer_ID) AS Sales_max, MIN(Sales) OVER (PARTITION BY Customer_ID) AS Sales_min, COUNT(Sales) OVER (PARTITION BY Customer_ID) AS Sales_count, AVG(Sales) OVER (PARTITION BY Customer_ID) AS Sales_avg FROM Customer_transaction order by Customer_ID; |
We can interpret the above output as “the customer with ID 1234 made total of 4 purchases and all were of worth10$ because of which the maximum, minimum and average of sales is 10$ for the customer with the ID 1234”
Ranking Functions
Rank
The rank function as the name suggests is used to rank the records across a particular variable. In the below example we have ranked a customer’s individual orders therefore we have written the statement “partition by Customer_ID” so that the ranking starts from one for each customer.
1 2 3 |
SELECT Customer_ID, Transaction_ID, State, Age, rank() OVER ( partition by Customer_ID order by Transaction_ID ) AS 'rank' FROM Customer_transaction; |
In the above output for the customer with ID “1234” we can see there are four transactions that have been ranked as 1,2,3,4 and for the second customer ID 1234 the two orders have been ranked as 1 and 2 exactly how we wanted the output to be.
Dense Rank
The difference between dense_rank and rank is that:
Rank gives the ranking within the partition. Ties are given the same rank, with the next ranking(s) skipped. So, if you have 2 records at rank 2, the next rank listed would be ranked 4.
Dense_rank also gives the ranking within the partition, but the ranks are consecutive. No ranks are skipped if there are ranks with multiple items.
1 2 3 |
SELECT Customer_ID,Transaction_ID, State, Age,Sales, DENSE_RANK() OVER ( partition by State order by Sales desc ) AS 'dense_rank' FROM Customer_transaction; |
In the above output it can be noticed that no ranks are skipped and the ranks are consecutive even so when the partition by column i.e. State is the same for all records
Row Number
The row number function simple adds a column to the table that contains the sequential number of rows.
1 2 3 4 5 |
SELECT Customer_ID, State, Sales, ROW_NUMBER() OVER(ORDER BY Customer_ID) RowNumber FROM Customer_transaction; |
N-tile
The n-tile function is used to divide the dataset into n number of quantiles for example the below dataset has been divided into 10 quantiles. This value of n can be passed on as an attribute in the n-tile function as can be seen below.
1 2 3 |
SELECT *, NTILE(10) OVER(ORDER BY Sales) AS ntile1 FROM Customer_transaction; |
The records that are marked as 1 in the n-tile column belong to the first quantile and the records marked as 2 belong to the second quantile. The whole dataset has been divided into 10 quantiles.
Value Functions
Lag
The lag function is used to find out the value stored in the previous record. In the below example we have calculated a Previous_Transaction_ID column using the lag function.
1 2 3 4 |
Select State, Customer_ID, Transaction_ID, lag (Transaction_ID) over (partition by Customer_ID order by Customer_ID) as 'Previous_Transaction_ID' from Customer_transaction; |
As can be seen in the above output the Previous_Transaction_ID column also contains nulls, this is because all the rows that are nulls represent the first order of the respective customer.
Lead
The lead function is very similar to the lag function, here instead of the previous record value we calculate the next record value as is illustrated in the below example.
1 2 3 |
Select State,Customer_ID, Transaction_ID, lead (Transaction_ID) over (partition by Customer_ID order by Customer_ID) as 'Next_Transaction_ID' from Customer_transaction; |
In the above output we can see that the Next Transaction ID is Null for some values, these records are null as they are the last transaction made by the respective customer.
First Value and Last Value
The first and last value aggregate functions are very similar to each other as the name infers, the first value function returns the first value of the respective record we are partitioning by and the last value function returns the last value of the record we are partitioning by.
1 2 3 4 5 6 7 8 9 10 |
Select Customer_ID, Transaction_ID, FIRST_VALUE(Transaction_ID) OVER ( partition by Customer_ID ORDER BY Transaction_ID) First_Transaction, LAST_VALUE(Transaction_ID) OVER ( partition by Customer_ID ORDER BY Transaction_ID RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) Last_Transaction FROM Customer_transaction; |
As can be seen in the above output for the Customer ID “1234” the first transaction ID is 1 and the last transaction ID is 234 as is displayed in the output.
We have now gained the knowledge about SQL window functions and we can now apply these functions to real world data and solve business problems more efficiently.