SQL to determine minimum sequential days of access?

The following User History table contains one record for every day a given user has accessed a website (in a 24 hour UTC period). It has many thousands of records, but only one record per day per user. If the user has not accessed the website for that day, no record will be generated.

Id      UserId   CreationDate
------  ------   ------------
750997      12   2009-07-07 18:42:20.723
750998      15   2009-07-07 18:42:20.927
751000      19   2009-07-07 18:42:22.283

What I'm looking for is a SQL query on this table with good performance, that tells me which userids have accessed the website for (n) continuous days without missing a day.

In other words, how many users have (n) records in this table with sequential (day-before, or day-after) dates? If any day is missing from the sequence, the sequence is broken and should restart again at 1; we're looking for users who have achieved a continuous number of days here with no gaps.

Any resemblance between this query and a particular Stack Overflow badge is purely coincidental, of course.. :)

12662 次浏览

If you can change the table schema, I'd suggest adding a column LongestStreak to the table which you'd set to the number of sequential days ending to the CreationDate. It's easy to update the table at login time (similar to what you are doing already, if no rows exist of the current day, you'll check if any row exists for the previous day. If true, you'll increment the LongestStreak in the new row, otherwise, you'll set it to 1.)

The query will be obvious after adding this column:

if exists(select * from table
where LongestStreak >= 30 and UserId = @UserId)
-- award the Woot badge.

Doing this with a single SQL query seems overly complicated to me. Let me break this answer down in two parts.

  1. What you should have done until now and should start doing now:
    Run a daily cron job that checks for every user wether he has logged in today and then increments a counter if he has or sets it to 0 if he hasn't.
  2. What you should do now:
    - Export this table to a server that doesn't run your website and won't be needed for a while. ;)
    - Sort it by user, then date.
    - go through it sequentially, keep a counter...

Seems like you could take advantage of the fact that to be continuous over n days would require there to be n rows.

So something like:

SELECT users.UserId, count(1) as cnt
FROM users
WHERE users.CreationDate > now() - INTERVAL 30 DAY
GROUP BY UserId
HAVING cnt = 30

This should do what you want but I don't have enough data to test efficiency. The convoluted CONVERT/FLOOR stuff is to strip the time portion off the datetime field. If you're using SQL Server 2008 then you could use CAST(x.CreationDate AS DATE).

DECLARE @Range as INT
SET @Range = 10


SELECT DISTINCT UserId, CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))
FROM tblUserLogin a
WHERE EXISTS
(SELECT 1
FROM tblUserLogin b
WHERE a.userId = b.userId
AND (SELECT COUNT(DISTINCT(CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, CreationDate)))))
FROM tblUserLogin c
WHERE c.userid = b.userid
AND CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, c.CreationDate))) BETWEEN CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate))) and CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))+@Range-1) = @Range)

Creation script

CREATE TABLE [dbo].[tblUserLogin](
[Id] [int] IDENTITY(1,1) NOT NULL,
[UserId] [int] NULL,
[CreationDate] [datetime] NULL
) ON [PRIMARY]

The answer is obviously:

SELECT DISTINCT UserId
FROM UserHistory uh1
WHERE (
SELECT COUNT(*)
FROM UserHistory uh2
WHERE uh2.CreationDate
BETWEEN uh1.CreationDate AND DATEADD(d, @days, uh1.CreationDate)
) = @days OR UserId = 52551

EDIT:

Okay here's my serious answer:

DECLARE @days int
DECLARE @seconds bigint
SET @days = 30
SET @seconds = (@days * 24 * 60 * 60) - 1
SELECT DISTINCT UserId
FROM (
SELECT uh1.UserId, Count(uh1.Id) as Conseq
FROM UserHistory uh1
INNER JOIN UserHistory uh2 ON uh2.CreationDate
BETWEEN uh1.CreationDate AND
DATEADD(s, @seconds, DATEADD(dd, DATEDIFF(dd, 0, uh1.CreationDate), 0))
AND uh1.UserId = uh2.UserId
GROUP BY uh1.Id, uh1.UserId
) as Tbl
WHERE Conseq >= @days

EDIT:

[Jeff Atwood] This is a great fast solution and deserves to be accepted, but Rob Farley's solution is also excellent and arguably even faster (!). Please check it out too!

If this is so important to you, source this event and drive a table to give you this info. No need to kill the machine with all those crazy queries.

Tweaking Bill's query a bit. You might have to truncate the date before grouping to count only one login per day...

SELECT UserId from History
WHERE CreationDate > ( now() - n )
GROUP BY UserId,
DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) AS TruncatedCreationDate
HAVING COUNT(TruncatedCreationDate) >= n

EDITED to use DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) instead of convert( char(10) , CreationDate, 101 ).

@IDisposable I was looking to use datepart earlier but i was too lazy to look up the syntax so i figured i d use convert instead. I dint know it had a significant impact Thanks! now i know.

You could use a recursive CTE (SQL Server 2005+):

WITH recur_date AS (
SELECT t.userid,
t.creationDate,
DATEADD(day, 1, t.created) 'nextDay',
1 'level'
FROM TABLE t
UNION ALL
SELECT t.userid,
t.creationDate,
DATEADD(day, 1, t.created) 'nextDay',
rd.level + 1 'level'
FROM TABLE t
JOIN recur_date rd on t.creationDate = rd.nextDay AND t.userid = rd.userid)
SELECT t.*
FROM recur_date t
WHERE t.level = @numDays
ORDER BY t.userid

Spencer almost did it, but this should be the working code:

SELECT DISTINCT UserId
FROM History h1
WHERE (
SELECT COUNT(*)
FROM History
WHERE UserId = h1.UserId AND CreationDate BETWEEN h1.CreationDate AND DATEADD(d, @n-1, h1.CreationDate)
) >= @n

I used a simple math property to identify who consecutively accessed the site. This property is that you should have the day difference between the first time access and last time equal to number of records in your access table log.

Here are SQL script that I tested in Oracle DB (it should work in other DBs as well):

-- show basic understand of the math properties
select    ceil(max (creation_date) - min (creation_date))
max_min_days_diff,
count ( * ) real_day_count
from   user_access_log
group by   user_id;




-- select all users that have consecutively accessed the site
select   user_id
from   user_access_log
group by   user_id
having       ceil(max (creation_date) - min (creation_date))
/ count ( * ) = 1;






-- get the count of all users that have consecutively accessed the site
select   count(user_id) user_count
from   user_access_log
group by   user_id
having   ceil(max (creation_date) - min (creation_date))
/ count ( * ) = 1;

Table prep script:

-- create table
create table user_access_log (id           number, user_id      number, creation_date date);




-- insert seed data
insert into user_access_log (id, user_id, creation_date)
values   (1, 12, sysdate);


insert into user_access_log (id, user_id, creation_date)
values   (2, 12, sysdate + 1);


insert into user_access_log (id, user_id, creation_date)
values   (3, 12, sysdate + 2);


insert into user_access_log (id, user_id, creation_date)
values   (4, 16, sysdate);


insert into user_access_log (id, user_id, creation_date)
values   (5, 16, sysdate + 1);


insert into user_access_log (id, user_id, creation_date)
values   (6, 16, sysdate + 5);

Something like this?

select distinct userid
from table t1, table t2
where t1.UserId = t2.UserId
AND trunc(t1.CreationDate) = trunc(t2.CreationDate) + n
AND (
select count(*)
from table t3
where t1.UserId  = t3.UserId
and CreationDate between trunc(t1.CreationDate) and trunc(t1.CreationDate)+n
) = n

Off the top of my head, MySQLish:

SELECT start.UserId
FROM UserHistory AS start
LEFT OUTER JOIN UserHistory AS pre_start ON pre_start.UserId=start.UserId
AND DATE(pre_start.CreationDate)=DATE_SUB(DATE(start.CreationDate), INTERVAL 1 DAY)
LEFT OUTER JOIN UserHistory AS subsequent ON subsequent.UserId=start.UserId
AND DATE(subsequent.CreationDate)<=DATE_ADD(DATE(start.CreationDate), INTERVAL 30 DAY)
WHERE pre_start.Id IS NULL
GROUP BY start.Id
HAVING COUNT(subsequent.Id)=30

Untested, and almost certainly needs some conversion for MSSQL, but I think that give some ideas.

How about (and please make sure the previous statement ended with a semi-colon):

WITH numberedrows
AS (SELECT ROW_NUMBER() OVER (PARTITION BY UserID
ORDER BY CreationDate)
- DATEDIFF(day,'19000101',CreationDate) AS TheOffset,
CreationDate,
UserID
FROM   tablename)
SELECT MIN(CreationDate),
MAX(CreationDate),
COUNT(*) AS NumConsecutiveDays,
UserID
FROM   numberedrows
GROUP  BY UserID,
TheOffset

The idea being that if we have list of the days (as a number), and a row_number, then missed days make the offset between these two lists slightly bigger. So we're looking for a range that has a consistent offset.

You could use "ORDER BY NumConsecutiveDays DESC" at the end of this, or say "HAVING count(*) > 14" for a threshold...

I haven't tested this though - just writing it off the top of my head. Hopefully works in SQL2005 and on.

...and would be very much helped by an index on tablename(UserID, CreationDate)

Edited: Turns out Offset is a reserved word, so I used TheOffset instead.

Edited: The suggestion to use COUNT(*) is very valid - I should've done that in the first place but wasn't really thinking. Previously it was using datediff(day, min(CreationDate), max(CreationDate)) instead.

Rob

How about one using Tally tables? It follows a more algorithmic approach, and execution plan is a breeze. Populate the tallyTable with numbers from 1 to 'MaxDaysBehind' that you want to scan the table (ie. 90 will look for 3 months behind,etc).

declare @ContinousDays int
set @ContinousDays = 30  -- select those that have 30 consecutive days


create table #tallyTable (Tally int)
insert into #tallyTable values (1)
...
insert into #tallyTable values (90) -- insert numbers for as many days behind as you want to scan


select [UserId],count(*),t.Tally from HistoryTable
join #tallyTable as t on t.Tally>0
where [CreationDate]> getdate()-@ContinousDays-t.Tally and
[CreationDate]<getdate()-t.Tally
group by [UserId],t.Tally
having count(*)>=@ContinousDays


delete #tallyTable

Joe Celko has a complete chapter on this in SQL for Smarties (calling it Runs and Sequences). I don't have that book at home, so when I get to work... I'll actually answer this. (assuming history table is called dbo.UserHistory and the number of days is @Days)

Another lead is from SQL Team's blog on runs

The other idea I've had, but don't have a SQL server handy to work on here is to use a CTE with a partitioned ROW_NUMBER like this:

WITH Runs
AS
(SELECT UserID
, CreationDate
, ROW_NUMBER() OVER(PARTITION BY UserId
ORDER BY CreationDate)
- ROW_NUMBER() OVER(PARTITION BY UserId, NoBreak
ORDER BY CreationDate) AS RunNumber
FROM
(SELECT UH.UserID
, UH.CreationDate
, ISNULL((SELECT TOP 1 1
FROM dbo.UserHistory AS Prior
WHERE Prior.UserId = UH.UserId
AND Prior.CreationDate
BETWEEN DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), -1)
AND DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), 0)), 0) AS NoBreak
FROM dbo.UserHistory AS UH) AS Consecutive
)
SELECT UserID, MIN(CreationDate) AS RunStart, MAX(CreationDate) AS RunEnd
FROM Runs
GROUP BY UserID, RunNumber
HAVING DATEDIFF(dd, MIN(CreationDate), MAX(CreationDate)) >= @Days

The above is likely WAY HARDER than it has to be, but left as an a brain tickle for when you have some other definition of "a run" than just dates.

assuming a schema that goes like:

create table dba.visits
(
id  integer not null,
user_id integer not null,
creation_date date not null
);

this will extract contiguous ranges from a date sequence with gaps.

select l.creation_date  as start_d, -- Get first date in contiguous range
(
select min(a.creation_date ) as creation_date
from "DBA"."visits" a
left outer join "DBA"."visits" b on
a.creation_date = dateadd(day, -1, b.creation_date ) and
a.user_id  = b.user_id
where b.creation_date  is null and
a.creation_date  >= l.creation_date  and
a.user_id  = l.user_id
) as end_d -- Get last date in contiguous range
from  "DBA"."visits" l
left outer join "DBA"."visits" r on
r.creation_date  = dateadd(day, -1, l.creation_date ) and
r.user_id  = l.user_id
where r.creation_date  is null
declare @startdate as datetime, @days as int
set @startdate = cast('11 Jan 2009' as datetime) -- The startdate
set @days = 5 -- The number of consecutive days


SELECT userid
,count(1) as [Number of Consecutive Days]
FROM UserHistory
WHERE creationdate >= @startdate
AND creationdate < dateadd(dd, @days, cast(convert(char(11), @startdate, 113)  as datetime))
GROUP BY userid
HAVING count(1) >= @days

The statement cast(convert(char(11), @startdate, 113) as datetime) removes the time part of the date so we start at midnight.

I would assume also that the creationdate and userid columns are indexed.

I just realized that this won't tell you all the users and their total consecutive days. But will tell you which users will have been visiting a set number of days from a date of your choosing.

Revised solution:

declare @days as int
set @days = 30
select t1.userid
from UserHistory t1
where (select count(1)
from UserHistory t3
where t3.userid = t1.userid
and t3.creationdate >= DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate), 0)
and t3.creationdate < DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate) + @days, 0)
group by t3.userid
) >= @days
group by t1.userid

I've checked this and it will query for all users and all dates. It is based on Spencer's 1st (joke?) solution, but mine works.

Update: improved the date handling in the second solution.

Some nicely expressive SQL along the lines of:

select
userId,
dbo.MaxConsecutiveDates(CreationDate) as blah
from
dbo.Logins
group by
userId

Assuming you have a user defined aggregate function something along the lines of (beware this is buggy):

using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Runtime.InteropServices;


namespace SqlServerProject1
{
[StructLayout(LayoutKind.Sequential)]
[Serializable]
internal struct MaxConsecutiveState
{
public int CurrentSequentialDays;
public int MaxSequentialDays;
public SqlDateTime LastDate;
}


[Serializable]
[SqlUserDefinedAggregate(
Format.Native,
IsInvariantToNulls = true, //optimizer property
IsInvariantToDuplicates = false, //optimizer property
IsInvariantToOrder = false) //optimizer property
]
[StructLayout(LayoutKind.Sequential)]
public class MaxConsecutiveDates
{
/// <summary>
/// The variable that holds the intermediate result of the concatenation
/// </summary>
private MaxConsecutiveState _intermediateResult;


/// <summary>
/// Initialize the internal data structures
/// </summary>
public void Init()
{
_intermediateResult = new MaxConsecutiveState { LastDate = SqlDateTime.MinValue, CurrentSequentialDays = 0, MaxSequentialDays = 0 };
}


/// <summary>
/// Accumulate the next value, not if the value is null
/// </summary>
/// <param name="value"></param>
public void Accumulate(SqlDateTime value)
{
if (value.IsNull)
{
return;
}
int sequentialDays = _intermediateResult.CurrentSequentialDays;
int maxSequentialDays = _intermediateResult.MaxSequentialDays;
DateTime currentDate = value.Value.Date;
if (currentDate.AddDays(-1).Equals(new DateTime(_intermediateResult.LastDate.TimeTicks)))
sequentialDays++;
else
{
maxSequentialDays = Math.Max(sequentialDays, maxSequentialDays);
sequentialDays = 1;
}
_intermediateResult = new MaxConsecutiveState
{
CurrentSequentialDays = sequentialDays,
LastDate = currentDate,
MaxSequentialDays = maxSequentialDays
};
}


/// <summary>
/// Merge the partially computed aggregate with this aggregate.
/// </summary>
/// <param name="other"></param>
public void Merge(MaxConsecutiveDates other)
{
// add stuff for two separate calculations
}


/// <summary>
/// Called at the end of aggregation, to return the results of the aggregation.
/// </summary>
/// <returns></returns>
public SqlInt32 Terminate()
{
int max = Math.Max((int) ((sbyte) _intermediateResult.CurrentSequentialDays), (sbyte) _intermediateResult.MaxSequentialDays);
return new SqlInt32(max);
}
}
}

A couple of SQL Server 2012 options (assuming N=100 below).

;WITH T(UserID, NRowsPrevious)
AS (SELECT UserID,
DATEDIFF(DAY,
LAG(CreationDate, 100)
OVER
(PARTITION BY UserID
ORDER BY CreationDate),
CreationDate)
FROM   UserHistory)
SELECT DISTINCT UserID
FROM   T
WHERE  NRowsPrevious = 100

Though with my sample data the following worked out more efficient

;WITH U
AS (SELECT DISTINCT UserId
FROM   UserHistory) /*Ideally replace with Users table*/
SELECT UserId
FROM   U
CROSS APPLY (SELECT TOP 1 *
FROM   (SELECT
DATEDIFF(DAY,
LAG(CreationDate, 100)
OVER
(ORDER BY CreationDate),
CreationDate)
FROM   UserHistory UH
WHERE  U.UserId = UH.UserID) T(NRowsPrevious)
WHERE  NRowsPrevious = 100) O

Both rely on the constraint stated in the question that there is at most one record per day per user.