How to use Window functions in SQL?

SQL has evolved a lot since its widely adopted standard SQL-92. Most of the queries which developers use belong to this standard. It represents the relational database model.

Markus Winand of modern-sql.com says SQL has grown five times since that standard and SQL is not just a relational database model anymore. You can store JSON types, write recursive queries and perform Window functions etc.

This post looks into the power of Window functions.

Let us look into it through a use case.

Let’s say you want to create statistics on the total number of corona cases across three cities in India over a time of five days.

Below is the data you have:

datecase_countcity
17-08-2020100mumbai
18-08-2020200mumbai
19-08-2020350mumbai
20-08-2020500mumbai
21-08-2020900mumbai
17-08-202010chennai
18-08-202070chennai
19-08-2020150chennai
20-08-2020400chennai
21-08-2020800chennai
17-08-2020300delhi
18-08-2020670delhi
19-08-2020980delhi
20-08-20201100delhi
21-08-20201500delhi

How do you find the total number of cases in each city?

You could use the aggregate function sum():

select city,sum(case_count) from corona_stats group by city

The above query returns a single row for each city:

citysum
mumbai2050
delhi4550
chennai1430

But what if we you want to retrieve all the rows and want the total count against each row. That is you don’t want to loose the identity of any row.

Finding total count:

Window functions come to the rescue!

There is just one more keyword needed to perform Window functions :

over()

Here is the query to find out the total cases in each city against each row :

select *,sum(case_count) over(partition by city ) from corona_stats

Here is the output:

datecase_countcitysum
17-08-202010chennai1430
21-08-2020800chennai1430
20-08-2020400chennai1430
19-08-2020150chennai1430
18-08-202070chennai1430
21-08-20201500delhi4550
17-08-2020300delhi4550
18-08-2020670delhi4550
19-08-2020980delhi4550
20-08-20201100delhi4550
17-08-2020100mumbai2050
21-08-2020900mumbai2050
20-08-2020500mumbai2050
19-08-2020350mumbai2050
18-08-2020200mumbai2050

partition by keyword is used to partition the records based on the city and then perform the aggregate operation sum() over each partition.

The same output returned by an aggregate function is returned by the Window function but without loosing each row. Window literally means a set of rows . And Window functions means performing one or more aggregate function against each row without loosing the row data in the output.

Finding running total:

Now let’s say you want to find the running total of the number of cases for each day. That is , the total count as of a particular day.

To achieve this , just use order by clause inside over() keyword:

select *,sum(case_count) over(partition by city order by date ) from corona_stats

Here is the output:

datedatecitysum
17-08-202010chennai10
18-08-202070chennai80
19-08-2020150chennai230
20-08-2020400chennai630
21-08-2020800chennai1430
17-08-2020300delhi300
18-08-2020670delhi970
19-08-2020980delhi1950
20-08-20201100delhi3050
21-08-20201500delhi4550
17-08-2020100mumbai100
18-08-2020200mumbai300
19-08-2020350mumbai650
20-08-2020500mumbai1150
21-08-2020900mumbai2050

As you see as the day progresses from 17-08-2020 to 21-08-2020 the running total in the last column increases giving the total count as of that particular day.

Finding count of the first day:

Now let’s say you want to find out the number of cases on the first day against each day . You need to use first_value() keyword.

The below query does that :

select *, first_value(case_count) over(partition by city order by date ) from corona_stats

This produces the below output:

datecase_countcityfirst_value
17-08-202010chennai10
18-08-202070chennai10
19-08-2020150chennai10
20-08-2020400chennai10
21-08-2020800chennai10
17-08-2020300delhi300
18-08-2020670delhi300
19-08-2020980delhi300
20-08-20201100delhi300
21-08-20201500delhi300
17-08-2020100mumbai100
18-08-2020200mumbai100
19-08-2020350mumbai100
20-08-2020500mumbai100
21-08-2020900mumbai100

As you see , the number of cases on the first day is displayed along the output for each date.

Finding count of previous and next days:

Now let’s say you want the number of cases on the previous day and the next day for each date.

To find the number of cases on the previous day use lag(case_count,1).

Here 1 means one day before the current day.

To find the number of cases on the next day use lead(case_count,1).

Again here 1 means one day after the current day.

Here is the query :

select city,date,case_count, lag(case_count,1) over(partition by city order by date ) as previous_day_count , lead(case_count,1) over(partition by city order by date) as next_day_count from corona_stats

Notice that you need to use over() keyword for every aggregate function (used both in lag() and lead() functions in the above case).

Here is the output:

citydatecase_countprevious_day_countnext_day_count
chennai17-08-202010NULL70
chennai18-08-20207010150
chennai19-08-202015070400
chennai20-08-2020400150800
chennai21-08-2020800400NULL
delhi17-08-2020300NULL670
delhi18-08-2020670300980
delhi19-08-20209806701100
delhi20-08-202011009801500
delhi21-08-202015001100NULL
mumbai17-08-2020100NULL200
mumbai18-08-2020200100350
mumbai19-08-2020350200500
mumbai20-08-2020500350900
mumbai21-08-2020900500NULL

Ranking cities

Now let’s say for a given date , we want to rank the cities based on the number of cases.

We can use rank() keyword for this.

Here is the query to rank the cities based on the total number of cases on 19 the August 2020:

select city ,date,case_count, rank() over (order by case_count desc) from corona_stats where date ='2020-08-19'

Here is the output:

citydatecase_countrank
delhi19-08-20209801
mumbai19-08-20203502
chennai19-08-20201503

If two cities have the same count they will be given the same rank and the next rank will be skipped (say delhi and mumbai have the same count 980 , they both will be assigned rank 1 and chennai will be assigned rank 3)

In order not to skip any rank use dense_rank() (in this case if delhi and mumbai have the same count 980 , they both will be assigned rank 1 and chennai will be assigned rank 2)

Finding row number:

Now let’s say you want to know the row number of each row.

This can be achieved through the keyword row_number().

Here is the query :

select city ,date,case_count, row_number() over (order by case_count) from corona_stats where date ='2020-08-19'

Here is the output:

citydatecase_countrow_number
chennai19-08-20201501
mumbai19-08-20203502
delhi19-08-20209803

Organizing cases into different groups:

Now let’s say you want to separate cases in each city into a certain number of groups.

Let’s say for each city you want to separate the cases into three different groups so that you can assign a health official to each of them.

Let’s sort the records based on the case count and then divide them into three groups for each city.

This can be achieved using ntile() keyword.

It simply separates the records into different groups and assigns them a group number.

The number passed as parameter to ntile() keyword represents the number of groups.

So ntile(3) means three groups will be created and a group number will be assigned to each of them in sequence.

Here is the query to separate the records into three different groups for each city:

select city ,date,case_count, ntile(3) over (partition by city order by case_count) from corona_stats

And here is the output:

citydatecase_countntile
chennai17-08-2020101
chennai18-08-2020701
chennai19-08-20201502
chennai20-08-20204002
chennai21-08-20208003
delhi17-08-20203001
delhi18-08-20206701
delhi19-08-20209802
delhi20-08-202011002
delhi21-08-202015003
mumbai17-08-20201001
mumbai18-08-20202001
mumbai19-08-20203502
mumbai20-08-20205002
mumbai21-08-20209003

As you see for each city , three different groups are created after the records are sorted based on case count. Since there are five records in total for each city, dividing them into three groups gives two groups each of two records and one group of the remaining record.

I used PostGreSQL database in the above samples. The query should be the same for other databases as well.

That’s it!

We got a glimpse of the power of Window functions in SQL!


Posted

in

by

Comments

Leave a Reply

Discover more from The Full Stack Developer

Subscribe now to keep reading and get access to the full archive.

Continue reading