Count Distinct and Window Functions

In today’s example we are going to see how to solve the following error using Count Distinct and Window Functions. It is an error that we have had with some queries and we want to share it with you. Remember that if you need additional help, do not hesitate to contact us .

The error starts from the following alert message.

Preparing the example

To reach the above conclusion and solve it, let’s first build a scenario. We are going to work with the Adventure Works DB ( https://docs.microsoft.com/es-es/sql/samples/adventureworks-install-configure?view=sql-server-ver15&tabs=ssms )

Let’s use the Product and SalesOrderDetail tables, both in the SalesLT schema. Each row of order details is part of an order and is related to a product included in the order. The product has a category and a color.

We will use the following query to start:

SELECT salesorderid ,
 Count ( * ) AS ItemsPerOrder ,
 Sum ( unitprice * orderqty ) AS Total
FROM saleslt . product p
 INNER JOIN saleslt . salesorderdetail s
 ON p. productid = s . productivity
GROUP BY salesorderid 

This query returns the count of items in each order and the total value of the order.

Let’s add a few more calculations to the query, none of them challenging:

SELECT salesorderid ,
 Count ( * ) AS ItemsPerOrder ,
 Sum ( unitprice * orderqty ) AS Total ,
 Count ( DISTINCT productcategoryid ) CategoriesPerOrder ,
 Count ( DISTINCT color ) ColorPerOrder
FROM saleslt . product p
 INNER JOIN saleslt . salesorderdetail s
 ON p. productid = s . productivity
GROUP BY salesorderid 

Identifying the problem of Count Distinct and Window Functions

Now, let’s imagine that, along with this information, we would also like to know how many different colors per category there are in this order.

The group of only has SalesOrderId. Because of that, our first natural conclusion is to try a window partition, like this:

SELECT SalesOrderID ,
Count ( * ) AS ItemsPerOrder ,
Sum ( unitPrice * OrderQty ) AS total ,
Count ( DISTINCT productcategoryid ) CategoriesPerOrder ,
Count ( DISTINCT of color ) ColorPerOrder ,
Count ( DISTINCT of color )
OVER (
partition BY productcategoryid ) ColorPerCategory
DE saleslt. product p
INNER JOIN saleslt . salesorderdetail s
ON p. productid = s . productivity
GROUP BY salesorderid

Our problem starts with this query. Count Distinct does not support window partitioning , we need to find a different way to achieve the same result

Solution Planning

We are counting the rows, so we can use DENSE_RANK to achieve the same result, extracting the last value last, we can use a MAX for that. This works similarly to the distinct count in that all matched, records with the same value, receive the same rank value, so the largest value will be the same as the distinct count.

There are two ranking functions: RANK and DENSE_RANK . The difference is how they treat ties.

RANK: After a tie, the tally skips the number of tied items, leaving a gap.

DENSE_RANK: No jump after a tie, count continues sequentially

The following query is an example of the difference:

SELECT ProductID ,
 color ,
 Range ( )
 OVER (
 ORDER BY color)[rango] ,
 DENSE_RANK ( )
 OVER (
 ORDER BY color) [DENSE_RANK]
DE saleslt . product
WHERE color IS NOT NULL 
list rank Count Distinct and Window Functions
The new query using DENSE_RANK will look like this:
SELECT salesorderid ,
 Count ( * ) AS ItemsPerOrder ,
 Sum ( unitprice * orderqty ) AS Total ,
 Count ( DISTINCT productcategoryid ) CategoriesPerOrder ,
 Count ( DISTINCT color ) ColorPerOrder ,
 Dense_rank ( )
 OVER (
 partition BY productcategoryid
 ORDER BY color ) ColorPerCategory
FROM saleslt . product p
 INNER JOIN saleslt . salesorderdetail s
 ON p. productid = s . productivity
GROUP BY salesorderid 

However, the result is not what we would expect:

The group by and over clause do not work together.
The fields used in the over clause must also be included in the group, for the query to work.

Solving the Solution

The first step to solving the problem is to add more fields to the group. Of course this will affect the whole result, it will not be what we really expect. The query will be like this:

SELECT SalesOrderID ,
 productcategoryid ,
 Count ( * ) AS ItemsPerOrder ,
 Sum ( unitPrice * OrderQty ) AS total ,
 1 Color Per Order ,
 DENSE_RANK ( )
 OVER (
 partition BY SalesOrderID , productcategoryid
 ORDER BY color ) ColorPerCategory
DE saleslt . product p
 INNER JOIN saleslt .salesorderdetail s
 ON p. productid = s . productivity
GROUP BY salesorderid ,
 productcategoryid ,
 Colour 

There are two interesting changes in the calculation:

  • CategoriesPerOrder was removed because “group by” is a few levels below this calculation.
  • ColorPerOrder is a fixed value, because we are grouping by color.

We need to do more calculations on the result of this query, the best solution for this is to use CTE – Common Table Expressions .

2nd consultation level

The 2nd level of calculations will be aggregating the data by ProductCategoryId , removing one of the aggregation levels.

Count Distinct and Window Functions

;WITH ranking
 AS (SELECT salesorderid,
 productcategoryid,
 Count(*) AS ItemsPerOrder,
 Sum(unitprice * orderqty) AS Total,
 Count(DISTINCT color) ColorPerOrder,
 Dense_rank()
 OVER (
 partition BY salesorderid, productcategoryid
 ORDER BY color) ColorPerCategory
 FROM saleslt.product p
 INNER JOIN saleslt.salesorderdetail s
 ON p.productid = s.productid
 GROUP BY salesorderid,
 productcategoryid,
 Colour)
SELECT salesorderid,
 Sum(itemsperorder) ItemsPerOrder,
 Sum(total) Total,
 1 CategoriesPerOrder,
 Sum(colorperorder) ColorPerOrder,
 Max(colorpercategory) ColorPerCategory
FROM ranking
GROUP BY salesorderid,
 productcategoryid

The query calculations are defined by how the aggregations were performed in the first query:

  • ItemsPerOrder: We do a SUM of the COUNT results, this will add the different counts.
  • Total: A simple SUM over the SUM already done.
  • CategoriesPerOrder – can have a fixed number of 1, since we are still adding by category
  • ColorPerOrder: We make a SUM over the already existing COUNT of the previous query
  • ColorPerCategory: After doing DENSE_RANK , we now need to extract the MAX value to have the same effect as COUNT DISTINCT

3rd Consultation Level

In the third step we reduce the aggregation, achieving our final result, the aggregation by SalesOrderID :

;WITH constant
 AS (SELECT salesorderid,
 productcategoryid,
 Count(*) AS ItemsPerOrder,
 Sum(unitprice * orderqty) AS Total,
 Count(DISTINCT color) ColorPerOrder,
 Dense_rank()
 OVER (
 partition BY salesorderid, productcategoryid
 ORDER BY color) ColorPerCategory
 FROM saleslt.product p
 INNER JOIN saleslt.salesorderdetail s
 ON p.productid = s.productid
 GROUP BY salesorderid,
 productcategoryid,
 Colour),
 cte2
 AS (SELECT salesorderid,
 Sum(itemsperorder) ItemsPerOrder,
 Sum(total) Total,
 1 CategoriesPerOrder,
 Sum(colorperorder) ColorPerOrder,
 Max(colorpercategory) ColorPerCategory
 FROM account
 GROUP BY salesorderid,
 productcategoryid)
SELECT salesorderid,
 Sum(itemsperorder) ItemsPerOrder,
 Sum(total) Total,
 Sum(categoriesperorder) CategoriesPerOrder,
 Sum(colorperorder) ColorPerOrder,
 Sum(colorpercategory) ColorPerCategory
FROM account
GROUP BY salesorderid 

Once again, the calculations are based on the previous queries. Some of them are the same as the second query, adding more rows. However, there are some different calculations:

  • CategoriesPerOrder: Converts to SUM to achieve the result we’d like.
  • ColorPerCategory – Converts to SUM, adding all distinct count results from each category.

The execution plan

The execution plan generated by this query is not as bad as we might imagine. This query could benefit from additional indexes and improved JOIN, but other than that the plan looks pretty good.

The join is done using the ProductId field, so an index into the SalesOrderDetail table by ProductId and covering the additional fields used will help the query

We can create the index in the declaration:

CREATE INDEX inorderdetail
ON saleslt.salesorderdetail (productid)
include (orderqty, unitprice)

What is interesting to note in this query plan is the SORT, which now takes 50% of the query. This does not mean that the execution time of the SORT has changed, it means that the execution time for the entire query was reduced and the SORT became a higher percentage of the total execution time.

We hope you will be useful. If you don’t want to miss our tickets. Join our newsletter monthly, to be up to date with all our publications. You will only receive one email per month.

Greetings.

Leave a Reply

Your email address will not be published. Required fields are marked *