How to use Window functions in SQL?

SQL has evolved a lot since its widely adopted standard SQL-92. Most of the queries which developers use belong to this standard. It represents the relational database model.

Markus Winand of modern-sql.com says SQL has grown five times since that standard and SQL is not just a relational database model anymore. You can store JSON types, write recursive queries and perform Window functions etc.

This post looks into the power of Window functions.

Let us look into it through a use case.

Let’s say you want to create statistics on the total number of corona cases across three cities in India over a time of five days.

Below is the data you have:

date	case_count	city
17-08-2020	100	mumbai
18-08-2020	200	mumbai
19-08-2020	350	mumbai
20-08-2020	500	mumbai
21-08-2020	900	mumbai
17-08-2020	10	chennai
18-08-2020	70	chennai
19-08-2020	150	chennai
20-08-2020	400	chennai
21-08-2020	800	chennai
17-08-2020	300	delhi
18-08-2020	670	delhi
19-08-2020	980	delhi
20-08-2020	1100	delhi
21-08-2020	1500	delhi

How do you find the total number of cases in each city?

You could use the aggregate function sum():

select city,sum(case_count) from corona_stats group by city

The above query returns a single row for each city:

city	sum
mumbai	2050
delhi	4550
chennai	1430

But what if we you want to retrieve all the rows and want the total count against each row. That is you don’t want to loose the identity of any row.

Finding total count:

Window functions come to the rescue!

There is just one more keyword needed to perform Window functions :

over()

Here is the query to find out the total cases in each city against each row :

select *,sum(case_count) over(partition by city ) from corona_stats

Here is the output:

date	case_count	city	sum
17-08-2020	10	chennai	1430
21-08-2020	800	chennai	1430
20-08-2020	400	chennai	1430
19-08-2020	150	chennai	1430
18-08-2020	70	chennai	1430
21-08-2020	1500	delhi	4550
17-08-2020	300	delhi	4550
18-08-2020	670	delhi	4550
19-08-2020	980	delhi	4550
20-08-2020	1100	delhi	4550
17-08-2020	100	mumbai	2050
21-08-2020	900	mumbai	2050
20-08-2020	500	mumbai	2050
19-08-2020	350	mumbai	2050
18-08-2020	200	mumbai	2050

partition by keyword is used to partition the records based on the city and then perform the aggregate operation sum() over each partition.

The same output returned by an aggregate function is returned by the Window function but without loosing each row. Window literally means a set of rows . And Window functions means performing one or more aggregate function against each row without loosing the row data in the output.

Finding running total:

Now let’s say you want to find the running total of the number of cases for each day. That is , the total count as of a particular day.

To achieve this , just use order by clause inside over() keyword:

select *,sum(case_count) over(partition by city order by date ) from corona_stats

Here is the output:

date	date	city	sum
17-08-2020	10	chennai	10
18-08-2020	70	chennai	80
19-08-2020	150	chennai	230
20-08-2020	400	chennai	630
21-08-2020	800	chennai	1430
17-08-2020	300	delhi	300
18-08-2020	670	delhi	970
19-08-2020	980	delhi	1950
20-08-2020	1100	delhi	3050
21-08-2020	1500	delhi	4550
17-08-2020	100	mumbai	100
18-08-2020	200	mumbai	300
19-08-2020	350	mumbai	650
20-08-2020	500	mumbai	1150
21-08-2020	900	mumbai	2050

As you see as the day progresses from 17-08-2020 to 21-08-2020 the running total in the last column increases giving the total count as of that particular day.

Finding count of the first day:

Now let’s say you want to find out the number of cases on the first day against each day . You need to use first_value() keyword.

The below query does that :

select *, first_value(case_count) over(partition by city order by date ) from corona_stats

This produces the below output:

date	case_count	city	first_value
17-08-2020	10	chennai	10
18-08-2020	70	chennai	10
19-08-2020	150	chennai	10
20-08-2020	400	chennai	10
21-08-2020	800	chennai	10
17-08-2020	300	delhi	300
18-08-2020	670	delhi	300
19-08-2020	980	delhi	300
20-08-2020	1100	delhi	300
21-08-2020	1500	delhi	300
17-08-2020	100	mumbai	100
18-08-2020	200	mumbai	100
19-08-2020	350	mumbai	100
20-08-2020	500	mumbai	100
21-08-2020	900	mumbai	100

As you see , the number of cases on the first day is displayed along the output for each date.

Finding count of previous and next days:

Now let’s say you want the number of cases on the previous day and the next day for each date.

To find the number of cases on the previous day use lag(case_count,1).

Here 1 means one day before the current day.

To find the number of cases on the next day use lead(case_count,1).

Again here 1 means one day after the current day.

Here is the query :

select city,date,case_count, lag(case_count,1) over(partition by city order by date ) as previous_day_count , lead(case_count,1) over(partition by city order by date) as next_day_count from corona_stats

Notice that you need to use over() keyword for every aggregate function (used both in lag() and lead() functions in the above case).

Here is the output:

city	date	case_count	previous_day_count	next_day_count
chennai	17-08-2020	10	NULL	70
chennai	18-08-2020	70	10	150
chennai	19-08-2020	150	70	400
chennai	20-08-2020	400	150	800
chennai	21-08-2020	800	400	NULL
delhi	17-08-2020	300	NULL	670
delhi	18-08-2020	670	300	980
delhi	19-08-2020	980	670	1100
delhi	20-08-2020	1100	980	1500
delhi	21-08-2020	1500	1100	NULL
mumbai	17-08-2020	100	NULL	200
mumbai	18-08-2020	200	100	350
mumbai	19-08-2020	350	200	500
mumbai	20-08-2020	500	350	900
mumbai	21-08-2020	900	500	NULL

Ranking cities

Now let’s say for a given date , we want to rank the cities based on the number of cases.

We can use rank() keyword for this.

Here is the query to rank the cities based on the total number of cases on 19 the August 2020:

select city ,date,case_count, rank() over (order by case_count desc) from corona_stats where date ='2020-08-19'

Here is the output:

city	date	case_count	rank
delhi	19-08-2020	980	1
mumbai	19-08-2020	350	2
chennai	19-08-2020	150	3

If two cities have the same count they will be given the same rank and the next rank will be skipped (say delhi and mumbai have the same count 980 , they both will be assigned rank 1 and chennai will be assigned rank 3)

In order not to skip any rank use dense_rank() (in this case if delhi and mumbai have the same count 980 , they both will be assigned rank 1 and chennai will be assigned rank 2)

Finding row number:

Now let’s say you want to know the row number of each row.

This can be achieved through the keyword row_number().

Here is the query :

select city ,date,case_count, row_number() over (order by case_count) from corona_stats where date ='2020-08-19'

Here is the output:

city	date	case_count	row_number
chennai	19-08-2020	150	1
mumbai	19-08-2020	350	2
delhi	19-08-2020	980	3

Organizing cases into different groups:

Now let’s say you want to separate cases in each city into a certain number of groups.

Let’s say for each city you want to separate the cases into three different groups so that you can assign a health official to each of them.

Let’s sort the records based on the case count and then divide them into three groups for each city.

This can be achieved using ntile() keyword.

It simply separates the records into different groups and assigns them a group number.

The number passed as parameter to ntile() keyword represents the number of groups.

So ntile(3) means three groups will be created and a group number will be assigned to each of them in sequence.

Here is the query to separate the records into three different groups for each city:

select city ,date,case_count, ntile(3) over (partition by city order by case_count) from corona_stats

And here is the output:

city	date	case_count	ntile
chennai	17-08-2020	10	1
chennai	18-08-2020	70	1
chennai	19-08-2020	150	2
chennai	20-08-2020	400	2
chennai	21-08-2020	800	3
delhi	17-08-2020	300	1
delhi	18-08-2020	670	1
delhi	19-08-2020	980	2
delhi	20-08-2020	1100	2
delhi	21-08-2020	1500	3
mumbai	17-08-2020	100	1
mumbai	18-08-2020	200	1
mumbai	19-08-2020	350	2
mumbai	20-08-2020	500	2
mumbai	21-08-2020	900	3

As you see for each city , three different groups are created after the records are sorted based on case count. Since there are five records in total for each city, dividing them into three groups gives two groups each of two records and one group of the remaining record.

I used PostGreSQL database in the above samples. The query should be the same for other databases as well.

That’s it!

We got a glimpse of the power of Window functions in SQL!

How to use Window functions in SQL?

Finding total count:

Finding running total:

Finding count of the first day:

Finding count of previous and next days:

Ranking cities

Finding row number:

Organizing cases into different groups:

Like this:

Comments

Leave a ReplyCancel reply

How to use Window functions in SQL?

Finding total count:

Finding running total:

Finding count of the first day:

Finding count of previous and next days:

Ranking cities

Finding row number:

Organizing cases into different groups:

Share this:

Like this:

Comments

Leave a ReplyCancel reply

Discover more from The Full Stack Developer