使用 DISTINCT 配分函数计数()可能超过

为了得到一个不同的 NumUsers 的运行总数,我尝试编写以下代码,如下所示:

NumUsers = COUNT(DISTINCT [UserAccountKey]) OVER (PARTITION BY [Mth])

管理工作室似乎对此不太高兴。当我删除 DISTINCT关键字时,这个错误就会消失,但是它不会是一个明显的计数。

DISTINCT在分区函数中似乎是不可能的。 我怎样才能找到清晰的数字呢?我是否使用更多的 传统方法,比如相关子查询?

进一步来看,这些 OVER函数的工作方式可能与 Oracle 不同,因为它们不能在 SQL-Server中用来计算运行总量。

我在 SQLfiddle上添加了一个现场示例,我试图用一个配分函数来计算运行总量。

222967 次浏览

我认为在 SQL-Server 2008R2中实现这一点的唯一方法是使用相关的子查询或外部应用程序:

SELECT  datekey,
COALESCE(RunningTotal, 0) AS RunningTotal,
COALESCE(RunningCount, 0) AS RunningCount,
COALESCE(RunningDistinctCount, 0) AS RunningDistinctCount
FROM    document
OUTER APPLY
(   SELECT  SUM(Amount) AS RunningTotal,
COUNT(1) AS RunningCount,
COUNT(DISTINCT d2.dateKey) AS RunningDistinctCount
FROM    Document d2
WHERE   d2.DateKey <= document.DateKey
) rt;

这可以在 SQL-Server2012中使用您建议的语法来完成:

SELECT  datekey,
SUM(Amount) OVER(ORDER BY DateKey) AS RunningTotal
FROM    document

但是,使用 DISTINCT仍然是不允许的,所以如果 DISTINCT 是必需的和/或如果升级不是一个选项,那么我认为 OUTER APPLY是你的最佳选择

使用 dense_rank()有一个非常简单的解决方案

dense_rank() over (partition by [Mth] order by [UserAccountKey])
+ dense_rank() over (partition by [Mth] order by [UserAccountKey] desc)
- 1

This will give you exactly what you were asking for: The number of distinct UserAccountKeys within each month.

我使用的解决方案类似于上面的 大卫,但是如果一些行应该被排除在计数之外,那么会有额外的扭曲。这假设[ UserAccountKey ]永远不为空。

-- subtract an extra 1 if null was ranked within the partition,
-- which only happens if there were rows where [Include] <> 'Y'
dense_rank() over (
partition by [Mth]
order by case when [Include] = 'Y' then [UserAccountKey] else null end asc
)
+ dense_rank() over (
partition by [Mth]
order by case when [Include] = 'Y' then [UserAccountKey] else null end desc
)
- max(case when [Include] = 'Y' then 0 else 1 end) over (partition by [Mth])
- 1

这里可以找到带有扩展示例的 SQLFiddle。

死灵法师:

It's relativiely simple to emulate a COUNT DISTINCT over PARTITION BY with MAX via DENSE_RANK:

;WITH baseTable AS
(
SELECT 'RM1' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM1' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR3' AS ADR
UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM3' AS RM, 'ADR2' AS ADR
)
,CTE AS
(
SELECT RM, ADR, DENSE_RANK() OVER(PARTITION BY RM ORDER BY ADR) AS dr
FROM baseTable
)
SELECT
RM
,ADR


,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY ADR) AS cnt1
,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM) AS cnt2
-- Not supported
--,COUNT(DISTINCT CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY CTE.ADR) AS cntDist
,MAX(CTE.dr) OVER (PARTITION BY CTE.RM ORDER BY CTE.RM) AS cntDistEmu
FROM CTE

注:
这里假设所涉及的字段是非空字段。
如果字段中有一个或多个 NULL 条目,则需要减去1。

在简单的 SQL 中有一个解决方案:

SELECT time, COUNT(DISTINCT user) OVER(ORDER BY time) AS users
FROM users

=>

SELECT time, COUNT(*) OVER(ORDER BY time) AS users
FROM (
SELECT user, MIN(time) AS time
FROM users
GROUP BY user
) t

我徘徊在这里,基本上与 whytheq相同的问题,并找到了 大卫的解决方案,但随后不得不审查我的旧自我教程笔记关于 DENSE _ RANK,因为我使用它是如此罕见: 为什么 DENSE _ RANK 而不是 RANK 或 ROW _ NUMBER,它实际上是如何工作的?在这个过程中,我更新了这个教程,包括我的版本的 大卫的解决方案,这个特殊的问题,然后认为它可能是有益的 SQL 新手(或像我这样的人谁忘记东西)。

整个教程文本可以复制/粘贴到查询编辑器中,然后每个示例查询可以(单独)取消注释并运行,以查看各自的结果。(默认情况下,此问题的解决方案在底部未注释。)或者,可以将每个示例分别复制到它们自己的查询编辑实例中,但是必须将 TBLx CTE 包含在每个示例中。

--WITH /* DB2 version */
--TBLx (Col_A, Col_B) AS (VALUES
--     (  7,     7  ),
--     (  7,     7  ),
--     (  7,     7  ),
--     (  7,     8  ))


WITH /* SQL-Server version */
TBLx    (Col_A, Col_B) AS
(SELECT  7,     7    UNION ALL
SELECT  7,     7    UNION ALL
SELECT  7,     7    UNION ALL
SELECT  7,     8)


/*** Example-A: demonstrates the difference between ROW_NUMBER, RANK and DENSE_RANK ***/


--SELECT Col_A, Col_B,
--  ROW_NUMBER() OVER(PARTITION BY Col_A ORDER BY Col_B) AS ROW_NUMBER_,
--  RANK() OVER(PARTITION BY Col_A ORDER BY Col_B)       AS RANK_,
--  DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DENSE_RANK_
--FROM TBLx


/* RESULTS:
Col_A  Col_B  ROW_NUMBER_  RANK_  DENSE_RANK_
7      7        1          1        1
7      7        2          1        1
7      7        3          1        1
7      8        4          4        2


ROW_NUMBER: Just increments for the three identical rows and increments again for the final unique row.
That is, it’s an order-value (based on "sort" order) but makes no other distinction.
                 

RANK: Assigns the same rank value to the three identical rows, then jumps to 4 for the fourth row,
which is *unique* with regard to the others.
That is, each identical row is ranked by the rank-order of the first row-instance of that
(identical) value-set.
                 

DENSE_RANK: Also assigns the same rank value to the three identical rows but the fourth *unique* row is
assigned a value of 2.
That is, DENSE_RANK identifies that there are (only) two *unique* row-types in the row set.
*/


/*** Example-B: to get only the distinct resulting "count-of-each-row-type" rows ***/


--  SELECT DISTINCT -- For unique returned "count-of-each-row-type" rows, the DISTINCT operator is necessary because
--                  -- the calculated DENSE_RANK value is appended to *all* rows in the data set.  Without DISTINCT,
--                  -- its value for each original-data row-type would just be replicated for each of those rows.
--
--    Col_A, Col_B,
--    DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DISTINCT_ROWTYPE_COUNT_
--  FROM TBLx


/* RESULTS:
Col_A  Col_B  DISTINCT_ROWTYPE_COUNT_
7      7            1
7      8            2
*/


/*** Example-C.1: demonstrates the derivation of the "count-of-all-row-types" (finalized in Example-C.2, below) ***/


--  SELECT
--    Col_A, Col_B,
--
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC) AS ROW_TYPES_COUNT_DESC_,
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC) AS ROW_TYPES_COUNT_ASC_,
--
--    -- Adding the above cases together and subtracting one gives the same total count for on each resulting row:
--
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
--       +
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
--      - 1   /* (Because DENSE_RANK values are one-based) */
--      AS ROW_TYPES_COUNT_
--  FROM TBLx


/* RESULTS:
COL_A  COL_B  ROW_TYPES_COUNT_DESC_  ROW_TYPES_COUNT_ASC_  ROW_TYPES_COUNT_
7      7            2                     1                    2
7      7            2                     1                    2
7      7            2                     1                    2
7      8            1                     2                    2
      

*/


/*** Example-C.2: uses the above technique to get a *single* resulting "count-of-all-row-types" row ***/


SELECT DISTINCT -- For a single returned "count-of-all-row-types" row, the DISTINCT operator is necessary because the
-- calculated DENSE_RANK value is appended to *all* rows in the data set.  Without DISTINCT, that
-- value would just be replicated for each original-data row.
                  

--    Col_A, Col_B, -- In order to get a *single* returned "count-of-all-row-types" row (and field), all other fields
-- must be excluded because their respective differing row-values will defeat the purpose of the
-- DISTINCT operator, above.
                   

DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
+
DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
- 1   /* (Because DENSE_RANK values are one-based) */
AS ROW_TYPES_COUNT_
FROM TBLx
  

/* RESULTS:


ROW_TYPES_COUNT_
2
*/