Survival analysis should be a standard part of every data scientist’s tool belt. Unless you work in clinical research, though, there’s a good chance it’s not part of yours.1 That’s a shame because survival analysis is super interesting and powerful.
One reason why survival analysis isn’t more popular could be that data scientists just don’t realize it’s a good fit for their projects, so in this article let’s talk about applications. Specifically:
- When and why you should consider survival analysis for a project
- What you get out of a survival analysis
- Examples of survival analysis applications other than clinical research.
When should I use survival analysis?
Kleinbaum and Klein’s introductory textbook defines survival analysis (page 4) as
…a collection of statistical procedures for data analysis for which the outcome variable of interest is time until an event occurs.
Wikipedia’s definition is similar and lists biological death and mechanical failure as example target events.
These definitions miss the point. Here’s the main question to ask when considering survival analysis:
Do we need to act before we observe all of the data?
If yes, use survival analysis.
When we can’t afford to wait for all the data to roll in, some of our data will be censored. We have to close our observation window at some point to finalize our dataset and make decisions, but at that point, some subjects will not yet have experienced the target event. For these subjects, we know only that the time-to-event is at least some duration.
For example, suppose we make kitchen appliances and we need to decide if a new manufacturing process decreases the lifespan of our toasters. We need to make the decision in a few months at most but toasters wear out over the course of years.2 We can’t just ignore the toasters that haven’t broken yet; that would lead us to underestimate toaster lifespan. So how do we incorporate that data? Survival analysis!
You, keen reader, probably noticed that I didn’t mention time-to-event in the survival analysis criterion above. Usually we do care about durations, it’s true, but not always.
Let’s stick with the toaster example. We might say (reasonably) that we don’t care how long toasters survive, we just want to make sure they last at least three years. Even at this threshold, we still have a censoring problem, because we need to make decisions with only a few months of data. Survival models take this censored data into account.
Survival analysis isn’t magic; it only works if we have enough non-censored observations. We can’t stop our toaster manufacturing study after just a few weeks if none of the toasters have failed yet.
Non-considerations
The terminology of survival analysis—survival, hazard, failure, etc—implies the target event must be a bad thing to be delayed as long as possible. This is emphatically not the case; in many applications, the outcome event is a good thing and we want it to happen as soon as possible. Lots of examples below.
What do I get out of a survival analysis?
Survival models accomplish the same things as other supervised models: prediction and causal inference, including experiment analysis. Survival models tend to output entire distributions and let the data scientist decide how to summarize them, whereas “mainstream” regression methods tend to assume the regression function is the expectation of the target variable given a feature vector.
Here are the three uses of survival analysis (that I know of, please comment if you know of more):
Quantify and visualize the distribution of durations for intuitive understanding. The shape of a survival (or conversion) curve can help to set expectations, baselines, and time thresholds for qualitative program analysis or other quantitative modeling studies.
Moving away from the toaster example (finally!), let’s say we want to help our Customer Care team to improve their performance. One aspect of this is the time it takes to resolve support tickets, so we compute and plot the Kaplan-Meier survival curve for these durations.
If the plot looks like curve A below, the support team is doing a relatively good job, but I would ask why a small fraction of tickets stays open much longer than the rest. If the plot looks like curve C, the team isn’t doing a great job; most tickets are open for 5 weeks. However, most tickets do get resolved quickly after 5 weeks, so there could be a systematic cause for the delay that we can fix.
Predict the survival curve for individual subjects. Survival curve predictions can be used to prioritize interventions, even if we don’t understand the underlying causes.
As with any regression, we often summarize prediction distributions with a single number, especially to order subjects. For survival analysis, median survival time has historically been more popular than mean survival time, but that doesn’t mean it’s better. For pairwise comparisons, a popular alternative is restricted mean survival time (RMST).
There are two wrinkles with survival prediction to keep in mind. First, the predicted median survival time may be infinite, because the predicted survival probability doesn’t drop below 50% even at the longest observed duration.
Second, we have to specify whether the prediction subject is new to the system, i.e. their duration starts at 0, or they are an existing subject, censored in the training set. In the latter case, we want to predict time remaining until the target event, not total time.
Estimate causal relationships to optimize duration. This bucket includes both field experiment analysis and observational causal inference. Don’t pay any heed to statistics texts that claim we just want to estimate associations; when the goal is to optimize time-to-event (in either direction), we need to know what levers we can pull and what their effects will be.3
Examples of survival analysis applications (that aren’t clinical research)
I think all data scientists should know at least a bit about survival analysis, but if your work touches any of the following applications areas, you really should consider adding survival analysis to your professional tool belt.
To be transparent, I don’t have practical experience in all these areas. Some are applications I’ve read about, others are ones where I can see a good fit, but don’t know of any actual attempts to apply survival analysis.
1. Hardware failure
This is a natural place to start because hardware failure is directly analogous to clinical research. In this case, we’re interested in predicting and preventing mechanical equipment failure instead of human medical outcomes.
2. Customer analytics
Understanding customer behavior is essential for B2C businesses. In this application, the subjects are individual customers and there are both good and bad events.
The good outcomes—which we want to accelerate—include marketing and sales conversions, especially for long transactions like making travel arrangements or applying for a mortgage. They also include up-selling to premium tiers and reaching critical levels of engagement.
The classic example of a bad event (from the company’s point of view) that we want to delay is customer churn. A recent Netflix paper describes this use case.
3. Product analytics: time to adoption
Suppose we want to understand how long it takes for customers to upgrade to the latest version of some product, like a mobile phone. We might use the survival curve to make operational plans or detect when adoption is slower or faster than expected. Causal survival models might help us to accelerate product adoption.
From a modeling perspective, this is the same as the customer analytics use cases, but here the target event is about the product instead of revenue.
4. Unit economics: time to break-even revenue
This one falls under customer analytics, but it’s more general. The target event here is earning revenue equal to cost. For example, in marketing, how long does it take for a new customer to generate revenue equal to their acquisition cost? For capital assets—let’s say rental cars, to be concrete—how long until each unit generates revenue equal to its cost. More usefully, how to reduce break-even-time to the point where we feel comfortable scaling the business?
5. Human Resources
Opportunities abound for survival analysis in HR. Modeling employee tenure is a good place to start; the People team needs to decide which policies work to boost morale or retain talent, but employee resignations (typically) happen over durations of years so most subjects have censored durations.
Other interesting things to model in HR are the time it takes to fill open positions and the time it takes for promotions.
6. Engineering and support tickets
Engineering and Customer Care teams often track work with ticket systems. Managers sometimes track the time it takes to resolve tickets and pressure their teams to reduce that time (hopefully as one objective of several). Survival analysis can be a great fit for this task if some tickets stay open long enough to cause data censoring.
Ticket durations are a good use case in particular for competing risks models where there are multiple outcome types. For example, we might be interested in the time until a successful support ticket resolution, but censoring might occur either because a ticket isn’t closed yet or because the customer never responded to follow-up questions.
This area is also fertile ground for cure models where some fraction of the subjects never experience the target event. In engineering, tickets dumped in the P4 or wontfix buckets stay open but will never be addressed. The Convoys Python package uses an example of Manhattan building code complaints to illustrate this use case.
7. Loan repayment
Here’s the point in the list where I start to feel the ice getting thinner under my skates. This use case and the next one seem like good candidates for survival analysis, but I don’t know of any specific examples in practice. Please leave a comment below or [drop me a note][contact-form] if you’ve worked with survival analysis in these areas.
Lenders are certainly interested in loan repayment, although I’ve always seen this modeled as a binary outcome: the client either repays the loan in full or they default. To me, it would seem more useful to model the duration of time to full repayment. Or, even better, the amount repaid over time. This kind of model would tell us not just the probability of repayment, but also how much we would be re-paid and by when.
8. Inventory management
Forecasting when product supply is going to run out can have a lot of business value, but this is an application where we need to keep in mind the primary survival analysis question about durations vs. decision-making cadence. Survival analysis is only a good fit for products that sell slowly, over the course of months.
Houses are a good example; we might want to track time-to-sale on a monthly basis, but houses often take more than a month to sell.
Final thoughts
Despite its narrowly-focused name, survival analysis is a general and powerful framework for modeling outcomes that occur more slowly than our decision-making cadence. If it’s not already in your data science tool belt, consider adding it.
I’m sure I’ve missed lots of good applications in this back-of-the-envelope list. Please leave a comment below if you’ve applied survival analysis in other domains.
Footnotes
The claim that survival analysis is rare outside of clinical research is a personal impression, not an empirical fact. Here’s one bit of evidence in support, though: the 2021 AAAI Symposium on Survival Prediction accepted five applications papers (of 22 total), all of which were about human health. Only one paper in the entire symposium even mentioned an application other than human health. Here’s another anecdote: Kleinbaum and Klein’s intro textbook uses five example datasets to illustrate concepts; all but one is about human health.↩︎
Another way to make decisions quickly is to simulate the process in a lab. Consumer Reports, for example, touts its mattress buying guide by saying
We subject each mattress to a battery of tests, including running a nearly 310-pound roller over each one 30,000 times to simulate eight to 10 years of use.
Uh huh. Better than nothing, I guess, but hardly generalizable to real-world performance.↩︎
Traditional statistics discussions of survival analysis goals are frustratingly milquetoast. Kleinbaum and Klein (2012, page 16), for example, say there are three goals of survival analysis:
Goal 1: To estimate and interpret survivor and/or hazard functions…
Well, yeah, of course the goal is to estimate the curves—that’s the definition of the thing. Interpretation is good, but for what purpose?
Goal 2: To compare survivor and/or hazard functions.
Why? Just ‘cause?
Goal 3: To assess the relationship of explanatory variables to survival time.
Look at the verbal tap dancing in the phrase “assess the relationship”. That could mean anything! What Kleinbaum and Klein are trying not to say is that our goal is to learn what causes durations to be longer or shorter, so we can improve the survival time in the direction we prefer.↩︎