How to convert event logs to duration tables for survival analysis

Tip

See also our article on creating duration tables from event logs in SQL.

In many applications, it is most natural to think about how much time it takes for some outcome to occur. In medicine, for example, we might ask whether a given treatment increases patients’ life span. For predictive maintenance, we want to know which pieces of hardware have the shortest time until failure. For customer care, how long until support tickets are resolved?

Survival analysis techniques directly address these kinds of questions, but my sense is that survival modeling is relatively rare in industry data science. One reason might be that setting up the data for survival modeling can be tricky but most survival analysis literature skips the data prep part. I’ll try to fill the gap, by showing how to create a duration table from an event log, using the example of web browser events.

What is an event log?

Event logs are a type of transactional fact table. They have a long-form schema where each entry represents an event, defined by the unit (e.g. user), timestamp, and type of event.

Mixpanel’s documentation shows an example log of user actions in a web browser. The key columns are USERID (the unit of interest), TIMESTAMP, and EVENT.

For event logs, the event is a categorical variable. Contrast this with sensor output, which is typically numeric.

What’s a duration table?

A duration table is a wide-form table that shows for each unit whether the event of interest occurred, and if so, how long it took to happen. It is a common input data format for survival analysis software.

A side note: the term duration table is based on the Python Lifelines package’s term duration matrix.¹ Most of the literature I have read assumes we have this kind of data in hand and doesn’t bother to give it an explicit name.²

With the Mixpanel example, suppose we define conversion to be the Checkout event. Then the first part of the duration table for that event log would be:

UserID	Entry time	Conversion time	Conversion event
qrt-345	2016-12-15T08:02	NA	NA
pnm-321	2016-12-15T08:07	NA	NA
Janet	2016-12-15T08:10	2016-12-15T08:15	Checkout
Vijay	2016-12-15T08:10	NA	NA

The first two columns are straightforward. Each row corresponds to a user, indicated by the user ID column. Entry time is the first recorded timestamp for each user.

The missing values in the Conversion time and Conversion event columns are where things get interesting. Here’s the core insight of survival analysis: those users may still convert in the future; we only know that they had not yet converted when the event log was closed. This is called censoring and it’s something that survival models handle gracefully, but we still need to represent it appropriately in our duration table.

Suppose the Mixpanel log was closed at 2016-12-15T08:20. The final duration table would be:

UserID	Entry time	Conversion time	Conversion event	Final observation time	Duration
qrt-345	2016-12-15T08:02	NA	NA	2016-12-15T08:20	18 seconds
pnm-321	2016-12-15T08:07	NA	NA	2016-12-15T08:20	13 seconds
Janet	2016-12-15T08:10	2016-12-15T08:15	Checkout	2016-12-15T08:15	5 seconds
Vijay	2016-12-15T08:10	NA	NA	2016-12-15T08:20	10 seconds

The final observation time is the conversion time if it’s known or the timestamp of the last moment of observation (i.e. when the log was closed). The duration is the time delta between the entry time and the final observation time. It will be our target variable in downstream modeling and is the field that is right-censored.³

The Metamorphosis

Input data

We’ll stick with the web browser example and use data from e-commerce site Retail Rocket, available on Kaggle. The dataset has 2.8 million rows and occupies about 250MB in memory.

Code

import pandas as pd

event_log = pd.read_csv("retailrocket_events.csv", dtype={"transactionid": "Int64"})
event_log.index.name = "row"
event_log["timestamp"] = pd.to_datetime(event_log["timestamp"], unit="ms")

event_log.head()

	timestamp	visitorid	event	itemid	transactionid
row
0	2015-06-02 05:02:12.117	257597	view	355908	<NA>
1	2015-06-02 05:50:14.164	992329	view	248676	<NA>
2	2015-06-02 05:13:19.827	111016	view	318965	<NA>
3	2015-06-02 05:12:35.914	483717	view	253185	<NA>
4	2015-06-02 05:02:17.106	951259	view	367447	<NA>

Schema

The first thing we need to do is identify the schema of our event log. For unit of interest, let’s go with visitor (visitorid), so we can answer the questions What fraction of users make a purchase and how long does it take them to do so? Item would also be an interesting unit of observation, and in real web browser data, we might also be interested in sessions.

There are only three kinds of events: view, addtocart, and transaction:

Code

event_log["event"].value_counts()

view           2664312
addtocart        69332
transaction      22457
Name: event, dtype: int64

The sequence of events for a visitor who completes a transaction looks like this:

Code

event_log.query("visitorid==1050575").sort_values("timestamp")

	timestamp	visitorid	event	itemid	transactionid
row
2751488	2015-07-31 17:27:21.908	1050575	view	116493	<NA>
2740586	2015-07-31 21:08:45.248	1050575	view	31640	<NA>
2750173	2015-07-31 21:09:21.716	1050575	view	273877	<NA>
2743671	2015-07-31 21:10:18.947	1050575	view	31640	<NA>
2743666	2015-07-31 21:11:12.772	1050575	addtocart	31640	<NA>
2755294	2015-07-31 21:12:56.570	1050575	transaction	31640	8354

The visitor views a bunch of items, eventually adds a few to their cart, then purchases the items in the cart in a transaction. Let’s define the transaction event to be our endpoint of interest. Endpoints are the events corresponding to outcomes we’re interested in; in clinical research, this might be death, disease recurrence, or recovery. In predictive maintenance, it could be hardware failure. In customer lifetime value, it would be reaching a revenue threshold. In principle, we could have multiple endpoints, so we specify this in a list.

Our schema is:

Code

unit = "visitorid"
timestamp = "timestamp"
event = "event"
endpoints = ["transaction"]

Find the entry time for each unit

The DataFrame groupby method is our workhorse because we need to treat each visitor separately. The simplest way to define entry time for each visitor is to take the earliest logged timestamp:

Code

grp = event_log.groupby(unit)
durations = pd.DataFrame(grp[timestamp].min())
durations.rename(columns={timestamp: "entry_time"}, inplace=True)
durations.head()

	entry_time
visitorid
0	2015-09-11 20:49:49.439
1	2015-08-13 17:46:06.444
2	2015-08-07 17:51:44.567
3	2015-08-01 07:10:35.296
4	2015-09-15 21:24:27.167

Find each unit’s endpoint time, if it exists

Finding the endpoint time for each visitor is trickier because some visitors have not yet completed a transaction but might in the future. Furthermore, some users have completed multiple transactions, in which case we want the timestamp of the earliest endpoint event.

We first filter the raw data down to just the endpoint events:

Code

endpoint_events = event_log.loc[event_log[event].isin(endpoints)]

then find the index (in the original DataFrame) of the earliest endpoint event for each visitor and extract just those rows. Remember that only the visitors who have an observed endpoint are represented in the endpoint_events DataFrame.

Code

grp = endpoint_events.groupby(unit)
idx_endpoint_events = grp[timestamp].idxmin()  # these are indices in the original DataFrame
endpoint_events = event_log.iloc[idx_endpoint_events].set_index(unit)
print(endpoint_events.head())

                        timestamp        event  itemid  transactionid
visitorid                                                            
172       2015-08-15 01:29:01.230  transaction  465522           9725
186       2015-08-12 16:34:57.040  transaction   49029           8726
264       2015-09-07 17:34:45.614  transaction  459835           8445
419       2015-07-29 05:03:12.695  transaction   19278          16455
539       2015-06-16 05:39:38.673  transaction   94371          14778

Because Pandas DataFrames have well-crafted indexing, we can merge the endpoints and endpoint times back into our primary output table with simple column assignments. If you’re translating this to another language, be very careful about indexing here—remember that these columns are not defined for every user.

Code

durations["endpoint"] = endpoint_events[event]
durations["endpoint_time"] = endpoint_events[timestamp]
durations.head()

	entry_time	endpoint	endpoint_time
visitorid
0	2015-09-11 20:49:49.439	NaN	NaT
1	2015-08-13 17:46:06.444	NaN	NaT
2	2015-08-07 17:51:44.567	NaN	NaT
3	2015-08-01 07:10:35.296	NaN	NaT
4	2015-09-15 21:24:27.167	NaN	NaT

Compute the target variable

Our target variable is the time it took for each unit to reach an endpoint or the censoring time, which is the end of our observation period. In the Mixpanel example above we chose an arbitrary timestamp, but in practice it’s usually easier to choose the current time (e.g. import datetime as dt; dt.datetime.now()) or the latest timestamp in the log.

We don’t want to lose track of which users have an observed endpoint, so we’ll create another column that defaults to the censoring time unless a visitor has an endpoint event.

Code

censoring_time = event_log["timestamp"].max()

durations["final_obs_time"] = durations["endpoint_time"].copy()
durations["final_obs_time"].fillna(censoring_time, inplace=True)

The last step is to compute the elapsed time between each visitor’s entry into the system and their final observed time. Once again Pandas makes this easy, returning a TimeDelta-type column by subtracting two timestamps.

Code

durations["duration"] = durations["final_obs_time"] - durations["entry_time"]
durations.head()

	entry_time	endpoint	endpoint_time	final_obs_time	duration
visitorid
0	2015-09-11 20:49:49.439	NaN	NaT	2015-09-18 02:59:47.788	6 days 06:09:58.349000
1	2015-08-13 17:46:06.444	NaN	NaT	2015-09-18 02:59:47.788	35 days 09:13:41.344000
2	2015-08-07 17:51:44.567	NaN	NaT	2015-09-18 02:59:47.788	41 days 09:08:03.221000
3	2015-08-01 07:10:35.296	NaN	NaT	2015-09-18 02:59:47.788	47 days 19:49:12.492000
4	2015-09-15 21:24:27.167	NaN	NaT	2015-09-18 02:59:47.788	2 days 05:35:20.621000

Each row of the durations DataFrame represents a visitor. Each unit has an entry time, a final observation time, and the duration between those two timestamps. We know whether or not a unit has reached an endpoint based on the endpoint and endpoint_time columns.

Sanity checks

We can sanity check the output using a few simple properties that should always be true, regardless of the input data. To start, we know that there should be one row in the output duration table for every unique visitorid in the event log.

Code

assert len(durations) == event_log[unit].nunique()

We can bound the duration variable by virtue of the way we constructed it: using the earliest observation in the data for each visitor’s entry time and the latest timestamp overall as the censoring time.

Code

assert durations["duration"].max() <= (
    event_log[timestamp].max() - event_log[timestamp].min()
)

And finally, we know that the number of visitors who have converted so far must be no bigger than the total number of endpoint events in the input log.

Code

assert (
    durations["endpoint_time"].notnull().sum() 
    <= event_log[event].isin(endpoints).sum()
)

Next steps

Now we should be able to use any survival analysis software, with minimal additional tweaks.

Example 1

Let’s fit a cumulative hazard function to our data with the Lifelines package, probably the most popular Python survival analysis package. To keep things simple, we first convert our durations column from TimeDelta format to floating-point days. The notnull method of a Pandas series gives us a boolean flag that indicates if each unit has an observed endpoint or has a censored duration.

Code

durations["endpoint_observed"] = durations["endpoint"].notnull()
durations["duration_days"] = (
    durations["duration"].dt.total_seconds() / (60 * 60 * 24)  # denom is number of seconds in a day
)

import lifelines
model = lifelines.NelsonAalenFitter()
model.fit(durations["duration_days"], durations["endpoint_observed"])
model.plot()

<AxesSubplot: xlabel='timeline'>

Example 2

Or, we could use the Scikit-survival package to fit a nonparametric survival curve.

Code

from sksurv.util import Surv
from sksurv.nonparametric import SurvivalFunctionEstimator

target = Surv().from_dataframe("endpoint_observed", "duration_days", durations)
model = SurvivalFunctionEstimator()
model.fit(target)
model.predict_proba([0, 20, 40, 60, 80, 100, 120])

array([0.99991617, 0.99211362, 0.99185032, 0.99167823, 0.99151507,
       0.99138487, 0.9912852 ])

Let’s plot the result, to confirm that something meaningful has happened. For our demo use case, it’s more natural to think of the conversion rate, which is just 1 minus the survival function.⁴

Code

import numpy as np
import plotly.express as px

time_grid = np.linspace(0, 120, 1000)
proba_survival = model.predict_proba(time_grid)
conversion_rate = 1 - proba_survival

fig = px.line(x=time_grid, y=conversion_rate * 100, template="plotly_white")
fig.update_traces(line=dict(width=7))
fig.update_layout(xaxis_title="Days elapsed", yaxis_title="Conversion rate (%)")
fig.show()

In upcoming articles, I’ll show how to convert other types of raw data into duration tables and how to interpret survival analysis results. Stay tuned! In the meantime, please leave a comment or question below. You can also contact me privately, with the Crosstab Kite contact form.

References

Listing image by Tsvetoslav Hristov on Unsplash.

Footnotes

The Lifelines API documentation says that a duration matrix need not include all units, presumably omitting those with no observed endpoint. My concept of a duration table is different in that it should contain one row for each unit.↩︎
An exhaustive literature search is beyond the scope of this article, but here are some examples of works that avoid naming the input to survival models.
- The Wikipedia entry for survival analysis shows an example of a duration table but only refers to it generically as a “survival data set”.
- The R package survival uses the class Surv to define a “survival object” target variable, with both the duration and censoring indicator. The survfit.formula function has a data parameter but defines it only by saying it’s a data frame that includes the variables named in the Surv object.
- Frank Harrell’s Regression Modeling Strategies does not appear to use an explicit term for the duration table. The survival analysis examples use datasets already in duration table format, named for their subject matter (e.g. prostate, p. 521)
- Rich, et al. A practical guide to understanding Kaplan-Meier curves explicitly describes survival inputs as a table but never gives that table a name.
  
  The first step in preparation for [Kaplan-Meier] analysis involves the construction of a table…containing the three key elements required for input…The table is then sorted…Once this initial table is constructed…
- Even the Lifelines package, from which I get the term duration matrix, is not consistent. It also uses the terms survival events and survival data for this concept.
Naming is hard and the field of survival analysis seems especially bad at it. I have more to say about this in a future post, but in the meantime, if you know of an existing, widespread, standard name for what I call duration table, please let me know.↩︎
Conversion data can also be left- or interval-censored. This article is only about right-censored data.↩︎
Technically, this is the cumulative incidence function.↩︎

--- title: "How to convert event logs to duration tables for survival analysis" author: "Brian Kent" date: "2021-06-07" image: tsvetoslav-hristov-QW-f6s9nFIs-unsplash.jpg categories: [code, python, survival analysis] description: | Survival models describe how much time it takes for some event to occur. This is a natural way to think about many applications but setting up the data can be tricky. In this article, we use Python to turn an event log into a duration table, which is the input format for many survival analysis tools. code-tools: true format: html: include-in-header: text: <link rel="canonical" href="https://www.crosstab.io/articles/events-to-durations/"> --- ::: {.callout-tip} See also our article on creating duration tables from event logs in [SQL][sql-article]. ::: In many applications, it is most natural to think about how much time it takes for some outcome to occur. In medicine, for example, we might ask whether a given treatment increases patients' life span. For predictive maintenance, we want to know which pieces of hardware have the shortest time until failure. For customer care, how long until support tickets are resolved? **Survival analysis** techniques directly address these kinds of questions, but my sense is that survival modeling is relatively rare in industry data science. One reason might be that setting up the data for survival modeling can be tricky but most survival analysis literature skips the data prep part. I'll try to fill the gap, by showing how to create a **duration table** from an **event log**, using the example of web browser events. ## What is an event log? Event logs are a type of [transactional fact table][wiki-fact-table]. They have a [long-form][wiki-long-data] schema where each entry represents an event, defined by the unit (e.g. user), timestamp, and type of event. [Mixpanel's documentation][mixpanel-demo] shows an example log of user actions in a web browser. The key columns are `USERID` (the unit of interest), `TIMESTAMP`, and `EVENT`. ![](mixpanel_events_demo.png) For event logs, the event is a categorical variable. Contrast this with sensor output, which is typically numeric. ## What's a duration table? A **duration table** is a wide-form table that shows for each unit whether the event of interest occurred, and if so, how long it took to happen. It is a common input data format for survival analysis software. A side note: the term *duration table* is based on the Python *Lifelines* package's term [duration matrix][lifelines-duration-matrix].^[The Lifelines API documentation says that a duration matrix need not include all units, presumably omitting those with no observed endpoint. My concept of a duration table is different in that it should contain one row for each unit.] Most of the literature I have read assumes we have this kind of data in hand and doesn't bother to give it an explicit name.[^duration-names] [^duration-names]: An exhaustive literature search is beyond the scope of this article, but here are some examples of works that avoid naming the input to survival models. - The [Wikipedia entry for survival analysis][wiki-survival] shows an example of a duration table but only refers to it generically as a “survival data set”. - The R package `survival` uses the class [`Surv`][r-surv] to define a “survival object” target variable, with both the duration and censoring indicator. The [`survfit.formula`][r-survfit] function has a `data` parameter but defines it only by saying it's a data frame that includes the variables named in the `Surv` object. - Frank Harrell's [Regression Modeling Strategies][rms] does not appear to use an explicit term for the duration table. The survival analysis examples use datasets already in duration table format, named for their subject matter (e.g. `prostate`, p. 521) - Rich, et al. [A practical guide to understanding Kaplan-Meier curves][rich-paper] explicitly describes survival inputs as a table but never gives that table a name. > The first step in preparation for [Kaplan-Meier] analysis involves the > construction of a table...containing the three key elements required for > input...The table is then sorted...Once this initial table is constructed... - Even the Lifelines package, from which I get the term *duration matrix*, is not consistent. It also uses the terms *survival events* and *survival data* for this concept. Naming is hard and the field of survival analysis seems especially bad at it. I have more to say about this in a future post, but in the meantime, if you know of an existing, widespread, standard name for what I call *duration table*, please let me know. With the Mixpanel example, suppose we define conversion to be the *Checkout* event. Then the first part of the duration table for that event log would be: | UserID | Entry time | Conversion time | Conversion event | | --- | --- | --- | --- | | qrt-345 | 2016-12-15T08:02 | NA | NA | | pnm-321 | 2016-12-15T08:07 | NA | NA | | Janet | 2016-12-15T08:10 | 2016-12-15T08:15 | Checkout | | Vijay | 2016-12-15T08:10 | NA | NA | The first two columns are straightforward. Each row corresponds to a user, indicated by the user ID column. Entry time is the first recorded timestamp for each user. The missing values in the *Conversion time* and *Conversion event* columns are where things get interesting. **Here's the core insight of survival analysis: those users may still convert in the future; we only know that they had not *yet* converted when the event log was closed.** This is called [censoring][wiki-censoring] and it's something that survival models handle gracefully, but we still need to represent it appropriately in our duration table. Suppose the Mixpanel log was closed at `2016-12-15T08:20`. The final duration table would be: | UserID | Entry time | Conversion time | Conversion event | Final observation time | Duration | | --- | --- | --- | --- | --- | --- | | qrt-345 | 2016-12-15T08:02 | NA | NA | 2016-12-15T08:20 | 18 seconds | | pnm-321 | 2016-12-15T08:07 | NA | NA | 2016-12-15T08:20 | 13 seconds | | Janet | 2016-12-15T08:10 | 2016-12-15T08:15 | Checkout | 2016-12-15T08:15 | 5 seconds | | Vijay | 2016-12-15T08:10 | NA | NA | 2016-12-15T08:20 | 10 seconds | The *final observation time* is the conversion time if it's known or the timestamp of the last moment of observation (i.e. when the log was closed). The *duration* is the time delta between the entry time and the final observation time. It will be our target variable in downstream modeling and is the field that is right-censored.^[Conversion data can also be left- or interval-censored. This article is only about right-censored data.] ## The Metamorphosis ### Input data We'll stick with the web browser example and use data from e-commerce site Retail Rocket, [available on Kaggle][retailrocket-data]. The dataset has 2.8 million rows and occupies about 250MB in memory. ```{python} import pandas as pd event_log = pd.read_csv("retailrocket_events.csv", dtype={"transactionid": "Int64"}) event_log.index.name = "row" event_log["timestamp"] = pd.to_datetime(event_log["timestamp"], unit="ms") event_log.head() ``` ### Schema The first thing we need to do is identify the schema of our event log. For **unit of interest**, let's go with *visitor* (`visitorid`), so we can answer the questions *What fraction of users make a purchase and how long does it take them to do so?* *Item* would also be an interesting unit of observation, and in real web browser data, we might also be interested in sessions. There are only three kinds of events: `view`, `addtocart`, and `transaction`: ```{python} event_log["event"].value_counts() ``` The sequence of events for a visitor who completes a transaction looks like this: ```{python} event_log.query("visitorid==1050575").sort_values("timestamp") ``` The visitor views a bunch of items, eventually adds a few to their cart, then purchases the items in the cart in a `transaction`. Let's define the transaction event to be our **endpoint** of interest. Endpoints are the events corresponding to outcomes we're interested in; in clinical research, this might be death, disease recurrence, or recovery. In predictive maintenance, it could be hardware failure. In customer lifetime value, it would be reaching a revenue threshold. In principle, we could have multiple endpoints, so we specify this in a list. Our schema is: ```{python} unit = "visitorid" timestamp = "timestamp" event = "event" endpoints = ["transaction"] ``` ### Find the entry time for each unit The DataFrame `groupby` method is our workhorse because we need to treat each visitor separately. The simplest way to define entry time for each visitor is to take the earliest logged timestamp: ```{python} grp = event_log.groupby(unit) durations = pd.DataFrame(grp[timestamp].min()) durations.rename(columns={timestamp: "entry_time"}, inplace=True) durations.head() ``` ### Find each unit's endpoint time, if it exists Finding the endpoint time for each visitor is trickier because some visitors have not yet completed a transaction but might in the future. Furthermore, some users have completed multiple transactions, in which case we want the timestamp of the *earliest* endpoint event. We first filter the raw data down to just the endpoint events: ```{python} endpoint_events = event_log.loc[event_log[event].isin(endpoints)] ``` then find the index (in the original DataFrame) of the earliest endpoint event for each visitor and extract just those rows. Remember that only the visitors who have an observed endpoint are represented in the `endpoint_events` DataFrame. ```{python} grp = endpoint_events.groupby(unit) idx_endpoint_events = grp[timestamp].idxmin() # these are indices in the original DataFrame endpoint_events = event_log.iloc[idx_endpoint_events].set_index(unit) endpoint_events.head() ``` Because Pandas DataFrames have well-crafted indexing, we can merge the endpoints and endpoint times back into our primary output table with simple column assignments. If you're translating this to another language, be very careful about indexing here---remember that these columns are not defined for every user. ```{python} durations["endpoint"] = endpoint_events[event] durations["endpoint_time"] = endpoint_events[timestamp] durations.head() ``` ### Compute the target variable Our target variable is the time it took for each unit to reach an endpoint *or* the **censoring time**, which is the end of our observation period. In the Mixpanel example above we chose an arbitrary timestamp, but in practice it's usually easier to choose the current time (e.g. `import datetime as dt; dt.datetime.now()`) or the latest timestamp in the log. We don't want to lose track of which users have an observed endpoint, so we'll create another column that defaults to the censoring time unless a visitor has an endpoint event. ```{python} censoring_time = event_log["timestamp"].max() durations["final_obs_time"] = durations["endpoint_time"].copy() durations["final_obs_time"].fillna(censoring_time, inplace=True) ``` The last step is to compute the elapsed time between each visitor's entry into the system and their final observed time. Once again Pandas makes this easy, returning a `TimeDelta`-type column by subtracting two timestamps. ```{python} durations["duration"] = durations["final_obs_time"] - durations["entry_time"] durations.head() ``` Each row of the `durations` DataFrame represents a visitor. Each unit has an entry time, a final observation time, and the duration between those two timestamps. We know whether or not a unit has reached an endpoint based on the `endpoint` and `endpoint_time` columns. ### Sanity checks We can sanity check the output using a few simple properties that should always be true, regardless of the input data. To start, we know that there should be one row in the output duration table for every unique `visitorid` in the event log. ```{python} assert len(durations) == event_log[unit].nunique() ``` We can bound the `duration` variable by virtue of the way we constructed it: using the earliest observation in the data for each visitor's entry time and the latest timestamp overall as the censoring time. ```{python} assert durations["duration"].max() <= ( event_log[timestamp].max() - event_log[timestamp].min() ) ``` And finally, we know that the number of visitors who have converted so far must be no bigger than the total number of endpoint events in the input log. ```{python} assert ( durations["endpoint_time"].notnull().sum() <= event_log[event].isin(endpoints).sum() ) ``` ## Next steps Now we should be able to use any survival analysis software, with minimal additional tweaks. ### Example 1 Let's fit a cumulative hazard function to our data with the [Lifelines][lifelines] package, probably the most popular Python survival analysis package. To keep things simple, we first convert our `durations` column from `TimeDelta` format to floating-point days. The `notnull` method of a Pandas series gives us a boolean flag that indicates if each unit has an observed endpoint or has a censored duration. ```{python} durations["endpoint_observed"] = durations["endpoint"].notnull() durations["duration_days"] = ( durations["duration"].dt.total_seconds() / (60 * 60 * 24) # denom is number of seconds in a day ) import lifelines model = lifelines.NelsonAalenFitter() model.fit(durations["duration_days"], durations["endpoint_observed"]) model.plot() ``` ### Example 2 Or, we could use the [Scikit-survival][sksurv] package to fit a nonparametric survival curve. ```{python} from sksurv.util import Surv from sksurv.nonparametric import SurvivalFunctionEstimator target = Surv().from_dataframe("endpoint_observed", "duration_days", durations) model = SurvivalFunctionEstimator() model.fit(target) model.predict_proba([0, 20, 40, 60, 80, 100, 120]) ``` Let's plot the result, to confirm that something meaningful has happened. For our demo use case, it's more natural to think of the conversion rate, which is just 1 minus the survival function.^[Technically, this is the **cumulative incidence function**.] ```{python} import numpy as np import plotly.express as px time_grid = np.linspace(0, 120, 1000) proba_survival = model.predict_proba(time_grid) conversion_rate = 1 - proba_survival fig = px.line(x=time_grid, y=conversion_rate * 100, template="plotly_white") fig.update_traces(line=dict(width=7)) fig.update_layout(xaxis_title="Days elapsed", yaxis_title="Conversion rate (%)") fig.show() ``` In upcoming articles, I'll show how to convert other types of raw data into duration tables and how to interpret survival analysis results. Stay tuned! In the meantime, please leave a comment or question below. You can also contact me privately, with the Crosstab Kite [contact form][contact-page]. ## References * Listing image by Tsvetoslav Hristov on [Unsplash](https://unsplash.com/photos/QW-f6s9nFIs). [orig-article]: https://www.crosstab.io/articles/conversion-rate-model-tradeoffs [mixpanel-demo]: https://mixpanel.com/wp-content/uploads/2018/06/System-architecture_June2018.pdf [retailrocket-data]: https://www.kaggle.com/retailrocket/ecommerce-dataset?select=events.csv [wiki-long-data]: https://en.wikipedia.org/wiki/Wide_and_narrow_data [wiki-fact-table]: https://en.wikipedia.org/wiki/Fact_table [wiki-censoring]: https://en.wikipedia.org/wiki/Censoring_(statistics) [lifelines]: https://lifelines.readthedocs.io/en/latest/ [lifelines-duration-matrix]: https://lifelines.readthedocs.io/en/latest/lifelines.utils.html#lifelines.utils.covariates_from_event_matrix [convoys]: https://better.engineering/convoys/ [contact-page]: https://crosstab.webflow.io/contact [gist-code]: https://github.com/CrosstabKite/gists/blob/main/events_to_durations.py [wiki-survival]: https://en.wikipedia.org/wiki/Survival_analysis [r-survfit]: https://www.rdocumentation.org/packages/survival/versions/3.2-11/topics/survfit.formula [r-surv]: https://www.rdocumentation.org/packages/survival/versions/3.2-11/topics/Surv [rms]: https://link.springer.com/book/10.1007%2F978-3-319-19425-7 [rich-paper]: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.945.3550&rep=rep1&type=pdf [sksurv]: https://scikit-survival.readthedocs.io/en/latest/index.html [sql-article]: ../events-to-durations-sql