Guide to SQL Window functions: When and how to use them?

Introduction

Window functions operate on a set of rows (window) and return an aggregated result hence the name “Window Functions”. Window Functions are extremely useful when we don’t want the output table to be grouped and want to retain the total number of rows as were before applying the function.

Window functions can be classified into the following three types:

Window_Function_Types

Window_Function_Types

The dataset that we will be using to illustrate the examples given in this code is a simple customer transaction dataset with demographic columns (State, Age) and transaction columns (Customer ID, Transaction ID, Sales) and it can be downloaded from this link

Aggregate Functions

Let’s first start with the aggregate window functions. These functions are used¬† to find out the maximum, minimum, mean, count and total of a particular metric. In the following code example, it has been illustrated how to aggregate the sales metric for each customer, keeping the number of rows consistent.

Aggregate_Functions_output

Aggregate_Functions_output

We can interpret the above output as “the customer with ID 1234 made total of 4 purchases and all were of worth10$ because of which the maximum, minimum and average of sales is 10$ for the customer with the ID 1234”

Ranking Functions

Rank

The rank function as the name suggests is used to rank the records across a particular variable. In the below example we have ranked a customer’s individual orders therefore we have written the statement “partition by Customer_ID” so that the ranking starts from one for each customer.

Rank_Output

Rank_Output

In the above output for the customer with ID “1234” we can see there are four transactions that have been ranked as 1,2,3,4 and for the second customer ID 1234 the two orders have been ranked as 1 and 2 exactly how we wanted the output to be.

Dense Rank

The difference between dense_rank and rank is that:

Rank gives the ranking within the partition. Ties are given the same rank, with the next ranking(s) skipped. So, if you have 2 records at rank 2, the next rank listed would be ranked 4.

Dense_rank also gives the ranking within the partition, but the ranks are consecutive. No ranks are skipped if there are ranks with multiple items.

Denserank_Output

Dense_rank Output

In the above output it can be noticed that no ranks are skipped and the ranks are consecutive even so when the partition by column i.e. State is the same for all records

Row Number

The row number function simple adds a column to the table that contains the sequential number of rows.

Rownum_Output

Rownum_Output

N-tile

The n-tile function is used to divide the dataset into n number of quantiles for example the below dataset has been divided into 10 quantiles. This value of n can be passed on as an attribute in the n-tile function as can be seen below.

ntile_Output

ntile_Output

The records that are marked as 1 in the n-tile column belong to the first quantile and the records marked as 2 belong to the second quantile. The whole dataset has been divided into 10 quantiles.

Value Functions

Lag

The lag function is used to find out the value stored in the previous record. In the below example we have calculated a Previous_Transaction_ID column using the lag function.

Lag_Output

Lag_Output

As can be seen in the above output the Previous_Transaction_ID column also contains nulls, this is because all the rows that are nulls represent the first order of the respective customer.

Lead

The lead function is very similar to the lag function, here instead of the previous record value we calculate the next record value as is illustrated in the below example.

Lead_Output

Lead_Output

In the above output we can see that the Next Transaction ID is Null for some values, these records are null as they are the last transaction made by the respective customer.

First Value and Last Value

The first and last value aggregate functions are very similar to each other as the name infers, the first value function returns the first value of the respective record we are partitioning by and the last value function returns the last value of the record we are partitioning by.

First_Last_Output

First_Last_Output

As can be seen in the above output for the Customer ID “1234” the first transaction ID is 1 and the last transaction ID is 234 as is displayed in the output.

We have now gained the knowledge about SQL window functions and we can now apply these functions to real world data and solve business problems more efficiently.

You may also like...