SQL has evolved a lot since its widely adopted standard SQL-92. Most of the queries which developers use belong to this standard. It represents the relational database model.
Markus Winand of modern-sql.com says SQL has grown five times since that standard and SQL is not just a relational database model anymore. You can store JSON types, write recursive queries and perform Window functions etc.
This post looks into the power of Window functions.
Let us look into it through a use case.
Let’s say you want to create statistics on the total number of corona cases across three cities in India over a time of five days.
Below is the data you have:
date | case_count | city |
17-08-2020 | 100 | mumbai |
18-08-2020 | 200 | mumbai |
19-08-2020 | 350 | mumbai |
20-08-2020 | 500 | mumbai |
21-08-2020 | 900 | mumbai |
17-08-2020 | 10 | chennai |
18-08-2020 | 70 | chennai |
19-08-2020 | 150 | chennai |
20-08-2020 | 400 | chennai |
21-08-2020 | 800 | chennai |
17-08-2020 | 300 | delhi |
18-08-2020 | 670 | delhi |
19-08-2020 | 980 | delhi |
20-08-2020 | 1100 | delhi |
21-08-2020 | 1500 | delhi |
How do you find the total number of cases in each city?
You could use the aggregate function sum():
select city,sum(case_count) from corona_stats group by city
The above query returns a single row for each city:
city | sum |
mumbai | 2050 |
delhi | 4550 |
chennai | 1430 |
But what if we you want to retrieve all the rows and want the total count against each row. That is you don’t want to loose the identity of any row.
Finding total count:
Window functions come to the rescue!
There is just one more keyword needed to perform Window functions :
over()
Here is the query to find out the total cases in each city against each row :
select *,sum(case_count) over(partition by city ) from corona_stats
Here is the output:
date | case_count | city | sum |
17-08-2020 | 10 | chennai | 1430 |
21-08-2020 | 800 | chennai | 1430 |
20-08-2020 | 400 | chennai | 1430 |
19-08-2020 | 150 | chennai | 1430 |
18-08-2020 | 70 | chennai | 1430 |
21-08-2020 | 1500 | delhi | 4550 |
17-08-2020 | 300 | delhi | 4550 |
18-08-2020 | 670 | delhi | 4550 |
19-08-2020 | 980 | delhi | 4550 |
20-08-2020 | 1100 | delhi | 4550 |
17-08-2020 | 100 | mumbai | 2050 |
21-08-2020 | 900 | mumbai | 2050 |
20-08-2020 | 500 | mumbai | 2050 |
19-08-2020 | 350 | mumbai | 2050 |
18-08-2020 | 200 | mumbai | 2050 |
partition by keyword is used to partition the records based on the city and then perform the aggregate operation sum() over each partition.
The same output returned by an aggregate function is returned by the Window function but without loosing each row. Window literally means a set of rows . And Window functions means performing one or more aggregate function against each row without loosing the row data in the output.
Finding running total:
Now let’s say you want to find the running total of the number of cases for each day. That is , the total count as of a particular day.
To achieve this , just use order by clause inside over() keyword:
select *,sum(case_count) over(partition by city order by date ) from corona_stats
Here is the output:
date | date | city | sum |
17-08-2020 | 10 | chennai | 10 |
18-08-2020 | 70 | chennai | 80 |
19-08-2020 | 150 | chennai | 230 |
20-08-2020 | 400 | chennai | 630 |
21-08-2020 | 800 | chennai | 1430 |
17-08-2020 | 300 | delhi | 300 |
18-08-2020 | 670 | delhi | 970 |
19-08-2020 | 980 | delhi | 1950 |
20-08-2020 | 1100 | delhi | 3050 |
21-08-2020 | 1500 | delhi | 4550 |
17-08-2020 | 100 | mumbai | 100 |
18-08-2020 | 200 | mumbai | 300 |
19-08-2020 | 350 | mumbai | 650 |
20-08-2020 | 500 | mumbai | 1150 |
21-08-2020 | 900 | mumbai | 2050 |
As you see as the day progresses from 17-08-2020 to 21-08-2020 the running total in the last column increases giving the total count as of that particular day.
Finding count of the first day:
Now let’s say you want to find out the number of cases on the first day against each day . You need to use first_value() keyword.
The below query does that :
select *, first_value(case_count) over(partition by city order by date ) from corona_stats
This produces the below output:
date | case_count | city | first_value |
17-08-2020 | 10 | chennai | 10 |
18-08-2020 | 70 | chennai | 10 |
19-08-2020 | 150 | chennai | 10 |
20-08-2020 | 400 | chennai | 10 |
21-08-2020 | 800 | chennai | 10 |
17-08-2020 | 300 | delhi | 300 |
18-08-2020 | 670 | delhi | 300 |
19-08-2020 | 980 | delhi | 300 |
20-08-2020 | 1100 | delhi | 300 |
21-08-2020 | 1500 | delhi | 300 |
17-08-2020 | 100 | mumbai | 100 |
18-08-2020 | 200 | mumbai | 100 |
19-08-2020 | 350 | mumbai | 100 |
20-08-2020 | 500 | mumbai | 100 |
21-08-2020 | 900 | mumbai | 100 |
As you see , the number of cases on the first day is displayed along the output for each date.
Finding count of previous and next days:
Now let’s say you want the number of cases on the previous day and the next day for each date.
To find the number of cases on the previous day use lag(case_count,1).
Here 1 means one day before the current day.
To find the number of cases on the next day use lead(case_count,1).
Again here 1 means one day after the current day.
Here is the query :
select city,date,case_count, lag(case_count,1) over(partition by city order by date ) as previous_day_count , lead(case_count,1) over(partition by city order by date) as next_day_count from corona_stats
Notice that you need to use over() keyword for every aggregate function (used both in lag() and lead() functions in the above case).
Here is the output:
city | date | case_count | previous_day_count | next_day_count |
chennai | 17-08-2020 | 10 | NULL | 70 |
chennai | 18-08-2020 | 70 | 10 | 150 |
chennai | 19-08-2020 | 150 | 70 | 400 |
chennai | 20-08-2020 | 400 | 150 | 800 |
chennai | 21-08-2020 | 800 | 400 | NULL |
delhi | 17-08-2020 | 300 | NULL | 670 |
delhi | 18-08-2020 | 670 | 300 | 980 |
delhi | 19-08-2020 | 980 | 670 | 1100 |
delhi | 20-08-2020 | 1100 | 980 | 1500 |
delhi | 21-08-2020 | 1500 | 1100 | NULL |
mumbai | 17-08-2020 | 100 | NULL | 200 |
mumbai | 18-08-2020 | 200 | 100 | 350 |
mumbai | 19-08-2020 | 350 | 200 | 500 |
mumbai | 20-08-2020 | 500 | 350 | 900 |
mumbai | 21-08-2020 | 900 | 500 | NULL |
Ranking cities
Now let’s say for a given date , we want to rank the cities based on the number of cases.
We can use rank() keyword for this.
Here is the query to rank the cities based on the total number of cases on 19 the August 2020:
select city ,date,case_count, rank() over (order by case_count desc) from corona_stats where date ='2020-08-19'
Here is the output:
city | date | case_count | rank |
delhi | 19-08-2020 | 980 | 1 |
mumbai | 19-08-2020 | 350 | 2 |
chennai | 19-08-2020 | 150 | 3 |
If two cities have the same count they will be given the same rank and the next rank will be skipped (say delhi and mumbai have the same count 980 , they both will be assigned rank 1 and chennai will be assigned rank 3)
In order not to skip any rank use dense_rank() (in this case if delhi and mumbai have the same count 980 , they both will be assigned rank 1 and chennai will be assigned rank 2)
Finding row number:
Now let’s say you want to know the row number of each row.
This can be achieved through the keyword row_number().
Here is the query :
select city ,date,case_count, row_number() over (order by case_count) from corona_stats where date ='2020-08-19'
Here is the output:
city | date | case_count | row_number |
chennai | 19-08-2020 | 150 | 1 |
mumbai | 19-08-2020 | 350 | 2 |
delhi | 19-08-2020 | 980 | 3 |
Organizing cases into different groups:
Now let’s say you want to separate cases in each city into a certain number of groups.
Let’s say for each city you want to separate the cases into three different groups so that you can assign a health official to each of them.
Let’s sort the records based on the case count and then divide them into three groups for each city.
This can be achieved using ntile() keyword.
It simply separates the records into different groups and assigns them a group number.
The number passed as parameter to ntile() keyword represents the number of groups.
So ntile(3) means three groups will be created and a group number will be assigned to each of them in sequence.
Here is the query to separate the records into three different groups for each city:
select city ,date,case_count, ntile(3) over (partition by city order by case_count) from corona_stats
And here is the output:
city | date | case_count | ntile |
chennai | 17-08-2020 | 10 | 1 |
chennai | 18-08-2020 | 70 | 1 |
chennai | 19-08-2020 | 150 | 2 |
chennai | 20-08-2020 | 400 | 2 |
chennai | 21-08-2020 | 800 | 3 |
delhi | 17-08-2020 | 300 | 1 |
delhi | 18-08-2020 | 670 | 1 |
delhi | 19-08-2020 | 980 | 2 |
delhi | 20-08-2020 | 1100 | 2 |
delhi | 21-08-2020 | 1500 | 3 |
mumbai | 17-08-2020 | 100 | 1 |
mumbai | 18-08-2020 | 200 | 1 |
mumbai | 19-08-2020 | 350 | 2 |
mumbai | 20-08-2020 | 500 | 2 |
mumbai | 21-08-2020 | 900 | 3 |
As you see for each city , three different groups are created after the records are sorted based on case count. Since there are five records in total for each city, dividing them into three groups gives two groups each of two records and one group of the remaining record.
I used PostGreSQL database in the above samples. The query should be the same for other databases as well.
That’s it!
We got a glimpse of the power of Window functions in SQL!
Leave a Reply