Survival modeling tools often require target variables to be specified in duration table format. A duration table lists each unit of a study (e.g. a website user), whether or not they experienced some outcome of interest (e.g. a purchase), and how long it took for that outcome to happen, if known.
Although duration tables are relatively straightforward conceptually, they are a bit tricky to construct from raw data. In my last article, I showed how to build a duration table in Python from an event log, using the example of web browser activity. Check out the first half of that article for a more detailed explanation of duration tables and event logs.
The thing is, event logs are usually stored in a database or data warehouse that we query with SQL, not Python. So in this article, I show how to build a duration table from an event log in SQL.1
The event log
Event logs are a type of fact table. Each row represents an event, defined by the unit (e.g. user), timestamp, and type of event. Our demo data is from e-commerce site Retail Rocket, downloaded from Kaggle. For this demo, suppose we’ve already copied the data into a table called event_log
.
SELECT * FROM event_log LIMIT 5;
visitorid | event_type | itemid | transactionid | event_at
-----------+------------+--------+---------------+----------------------------
257597 | view | 355908 | | 2015-06-01 22:02:12.117-07
992329 | view | 248676 | | 2015-06-01 22:50:14.164-07
111016 | view | 318965 | | 2015-06-01 22:13:19.827-07
483717 | view | 253185 | | 2015-06-01 22:12:35.914-07
951259 | view | 367447 | | 2015-06-01 22:02:17.106-07
The fields we’re interested in are:
visitorid
: ID of a Retail Rocket user, presumably. Event logs are long-form tables, with records from all of the units.event_type
event_at
Most rows in the log are view
events. There are alsoaddtocart
and transaction
(i.e. purchase) events, but they are much rarer. Most visitors in this dataset do not have any transactions at all.
SELECT
event_type,count(1) as num_observations
FROM event_log
GROUP BY 1
ORDER BY 2 DESC;
event_type | num_observations
-------------+------------------
view | 2664312
addtocart | 69332
transaction | 22457
The sequence of events for a visitor who does complete a transaction looks like this:
SELECT
*
FROM event_log
WHERE visitorid='1050575'
ORDER BY event_at ASC;
visitorid | event_type | itemid | transactionid | event_at
-----------+-------------+--------+---------------+----------------------------
1050575 | view | 116493 | | 2015-07-31 10:27:21.908-07
1050575 | view | 31640 | | 2015-07-31 14:08:45.248-07
1050575 | view | 273877 | | 2015-07-31 14:09:21.716-07
1050575 | view | 31640 | | 2015-07-31 14:10:18.947-07
1050575 | addtocart | 31640 | | 2015-07-31 14:11:12.772-07
1050575 | transaction | 31640 | 8354 | 2015-07-31 14:12:56.57-07
The full query and result
To give you a sense of what the final duration table looks like, here’s the full query and the first five rows of the output. In the next section, we’ll break it down and walk through each part in sequence.
-- Find the entry time for each unit.
WITH entry_times AS (
SELECT
visitorid,min(event_at) AS event_at
FROM event_log
GROUP BY 1
),
-- Get the earliest endpoint event for units that have an endpoint.
AS (
endpoint_events SELECT *
FROM event_log
WHERE event_type IN ('transaction')
),
AS (
first_endpoint_events SELECT
*
FROM (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY visitorid ORDER BY event_at ASC) AS row_num
FROM endpoint_events
AS _
) WHERE row_num = 1
),
-- Define the censoring time to be the latest timestamp in the whole event log.
AS (
censoring SELECT max(event_at) AS event_at FROM event_log
)
-- Put all the pieces together as a *duration table*.
SELECT
entry_times.visitorid,as entry_at,
entry_times.event_at AS endpoint_type,
endpt.event_type AS endpoint_at,
endpt.event_at COALESCE(endpt.event_at, censoring.event_at) as final_obs_at,
COALESCE(endpt.event_at, censoring.event_at) - entry_times.event_at as duration
FROM censoring, entry_times
LEFT JOIN first_endpoint_events AS endpt
USING(visitorid)
visitorid | entry_at | endpoint_type | endpoint_at | final_obs_at | duration
-----------+----------------------------+---------------+-------------+----------------------------+----------------------
0 | 2015-09-11 13:49:49.439-07 | | | 2015-09-17 19:59:47.788-07 | 6 days 06:09:58.349
1 | 2015-08-13 10:46:06.444-07 | | | 2015-09-17 19:59:47.788-07 | 35 days 09:13:41.344
2 | 2015-08-07 10:51:44.567-07 | | | 2015-09-17 19:59:47.788-07 | 41 days 09:08:03.221
3 | 2015-08-01 00:10:35.296-07 | | | 2015-09-17 19:59:47.788-07 | 47 days 19:49:12.492
4 | 2015-09-15 14:24:27.167-07 | | | 2015-09-17 19:59:47.788-07 | 2 days 05:35:20.621
Breaking it down
Let’s look at each of the Common Table Expressions (CTEs) as though they were standalone queries, then we’ll parse the final statement that pulls everything together.
Find the entry time for each unit
The first thing we need to do is to find when each visitor first entered the system. The simplest way is to use the earliest logged timestamp. As with the Python version of this article, the GROUP BY
functionality is our workhorse because we need to operate on each unit separately.
SELECT
visitorid,min(event_at) AS event_at
FROM event_log
GROUP BY 1
visitorid | event_at
-----------+----------------------------
0 | 2015-09-11 13:49:49.439-07
1 | 2015-08-13 10:46:06.444-07
2 | 2015-08-07 10:51:44.567-07
3 | 2015-08-01 00:10:35.296-07
4 | 2015-09-15 14:24:27.167-07
Find each unit’s endpoint time, if it exists
Finding the endpoint time for each visitor is trickier. First things first, we define our endpoint of interest to be the transaction
event. Some visitors don’t have an observed transaction, though, while others have multiple transactions.
We first filter the raw data down to just the endpoint events, i.e. transactions, then use a window function to take the earliest endpoint event for each visitor. Keep in mind that only visitors who have an observed endpoint are represented in this result; we’ll need that fact in two more paragraphs.
WITH endpoint_events AS (
SELECT *
FROM event_log
WHERE event_type IN ('transaction')
)
SELECT
*
FROM (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY visitorid ORDER BY event_at ASC) AS row_num
FROM endpoint_events
AS _
) WHERE row_num = 1
visitorid | event_type | itemid | transactionid | event_at | row_num
-----------+-------------+--------+---------------+----------------------------+---------
172 | transaction | 465522 | 9725 | 2015-08-14 18:29:01.23-07 | 1
186 | transaction | 49029 | 8726 | 2015-08-12 09:34:57.04-07 | 1
264 | transaction | 459835 | 8445 | 2015-09-07 10:34:45.614-07 | 1
419 | transaction | 19278 | 16455 | 2015-07-28 22:03:12.695-07 | 1
539 | transaction | 94371 | 14778 | 2015-06-15 22:39:38.673-07 | 1
Define the censoring time
Our target variable is the time it took for each visitor to make a transaction or reach the censoring time, which is the end of our observation period. In practice, it’s usually best to choose either the current time or the latest timestamp in the event log.
SELECT max(event_at) AS event_at FROM event_log
event_at
----------------------------
2015-09-17 19:59:47.788-07
Putting it all together
Last but not least, we pull all the pieces together (this snippet doesn’t run on its own, it requires the CTEs defined in the full query listed above).
SELECT
entry_times.visitorid,as entry_at,
entry_times.event_at AS endpoint_type,
endpt.event_type AS endpoint_at,
endpt.event_at COALESCE(endpt.event_at, censoring.event_at) as final_obs_at,
COALESCE(endpt.event_at, censoring.event_at) - entry_times.event_at as duration
FROM censoring, entry_times
LEFT JOIN first_endpoint_events AS endpt
USING(visitorid)
There’s a lot going on here:
We start with the
entry_times
CTE because we know it includes all of the visitors, one row for each. This is how we want the duration table to be structured.The
censoring
CTE does not need to be joined because it’s a single value. It is automatically broadcast to every row.We use a
LEFT JOIN
to bring in fields from thefirst_endpoint_events
CTE becausefirst_endpoint_events
only has rows for visitors who had an observed transaction event.The final observation timestamp (
final_obs_at
) is the time of the first transaction if a visitor has one, or it defaults to the censoring time.Lastly, the
duration
is the time interval between each visitor’s entry time and the final observation time. For PostgreSQL, this is easily computed with the minus operator.
Wrapping up
Now we have a complete duration table, with one row for each unit. Each row has an entry time, a final observation time, and the duration between those two timestamps. We know whether or not a unit has an observed endpoint based on the dpoint_type
and endpoint_time
columns; missing values indicate the unit has not experienced an endpoint.
With this table, we should be able to use most survival analysis software with only minimal additional tweaks. Switch back to the Python version of this article to see how to use this table with the Scikit-survival and Lifelines packages.
References
- Listing image by Aron Visuals on Unsplash.
Footnotes
Specifically, I’m using PostgreSQL 12.7 on Ubuntu 20.04. I follow GitLab’s SQL style guide as much as possible.↩︎