

For a time-to-event model people usually just define a hazard function. What does it mean? If $T$ is the non-negative random variable representing the time when the event occurs for the first time, what does a given hazard function tell us about the distribution of $T$? How noisy will the data be if I generate data of $T$ with a given hazard function?
For comparing different time-to-event models, the concordance index (C-index) is a common choice. What is the theoretical maximum C-index for an oracle model that knows the true hazard exactly?
The definition of a hazard function is
$$
h(t)
= \lim_{\Delta t \to 0}
\frac{P\big(t \le T < t + \Delta t \,\big|\, T \ge t\big)}{\Delta t}
$$
so a hazard function encodes the assumptions about the distribution of $T$. But
how does it relate to for example the probability density function $f_T(t)$
of $T$? Well, we see that
$$
\begin{align}
h(t)
&= \lim_{\Delta t \to 0}
\frac{P\big(t \le T < t + \Delta t\big)}{P(T\geq t) \Delta t} = \frac{1}{P(T\geq t)}\lim_{\Delta t \to 0}
\frac{P\big(t \le T < t + \Delta t\big)}{ \Delta t} = \frac{f_T(t)}{P(T\geq t)}
\end{align}
$$
where we used the rule of conditional probability and definition of a probability density. So the key identity is
$$
h(t) = \frac{\frac{\text{d}}{\text{d}t}F_T(t)}{1-F_T(t)}
$$
where $F_T(t) = P(T \leq t)$ is the cumulative distribution function (CDF) of $T$. Using
the survival function notation $S(t) = 1 - F_T(t) = P(T \geq t)$, we get
$$
h(t) = \frac{\frac{\text{d}}{\text{d}t}(1-S(t))}{S(t)} = \frac{- \frac{\text{d}}{\text{d}t} S(t)}{S(t)} = - \frac{\text{d}}{\text{d}t } \log S(t)
$$
which leads to
$$
S(t) = \exp \left(-\int_0^t h(u) \text{d}u \right).
$$
An additional feature of time-to-event data is censoring. We typically observe right-censored data, meaning that the event time $T$ is either observed at $t$, or we just observe that its value has to be larger than $t_{\text{max}}$.
This still isn't very concrete, so let us take an example. If we have a constant hazard $h(t) = h_0$, we get
$$
S(t) = \exp(-h_0 t)
$$
and
$$
F_T(t) = 1 - \exp(-h_0 t)
$$
which happens to be the CDF of the exponential distribution with rate $h_0$.
Now let us assume that we have subjects of different age, and the hazard is
$$
h(t) = h_x = \exp(\log \lambda_0 + \beta x)
$$
where $x$ is the normalized subject age. The below R functions can be used to simulate event data so that the original age is drawn uniformly from $[30, 90]$.
library(dplyr)
library(survival)
# True hazard rate
hazard <- function(log_h0, x, beta) {
exp(log_h0 + beta * x)
}
# Simulate data
simulate_data <- function(N, log_h0, beta) {
age <- stats::runif(N, min = 30, max = 90)
norm_age <- (age - mean(age)) / sd(age)
rate <- hazard(log_h0, norm_age, beta)
t <- rexp(N, rate = rate)
data.frame(subject = 1:N, event_time = t, age, norm_age)
}
# Censor event time
apply_censor <- function(df, tmax, log_h0, beta) {
df$time <- pmin(df$event_time, tmax)
df$status <- as.numeric(df$time != tmax)
df$surv_time <- survival::Surv(df$time, df$status)
df$surv_prob <- exp(-tmax * hazard(log_h0, df$norm_age, beta))
df
}
I use these to generate an event time for 2000 subjects so that $\beta=2$ and $\log \lambda_0 = -4$. On the upper row (a) of the below plot I visualize the distribution of event times over subjects, as well as the true probability of surviving until time $t_{\text{max}}=100$, which is $S(100) = \exp(-100 h_ x)$.

In survival analysis, the C-index of a prediction can be interpreted as the probability that, for a randomly chosen pair of comparable subjects, the one with the earlier event time has a higher predicted risk (lower survival probability). A pair is comparable if we can unambiguously determine which subject had the earlier event (event times of both can't be censored). The maximum value is 1 (subjects ordered by survival probability are in the same order as the observed event times) and value 0.5 corresponds to random guessing.
On the bottom row (b) of the previous plot is the event time as a function of the true survival probability. Also, I have computed the C-index which tells how concordant the predictions (survival probabilities until $t=100$) are with the fully observed event times (event_time) or the censored versions (surv_time). Now, this is the oracle model which knows the true event probability, but it doesn't achieve a perfect C-index of 1. This is because the event time data is "noisy", meaning that a subject A who has a higher hazard than subject B, can get a lower event time realization than subject B. So what is the theoretical maximum that we can achieve, and how does it depend on the different factors $\lambda_0$, $t_{\text{max}}$ and $\beta$?
The below R function can be used to simulate data and compute the concordance index with a given number of subjects $N$, value of $\log(\lambda_0)$, $t_ {\text{max}}$ and $\beta$.
library(dplyr)
library(survival)
compute_ci <- function(N, log_base_rate, tmax, beta) {
df <- simulate_data(N, log_base_rate, beta)
df_cens <- apply_censor(df, tmax, log_base_rate, beta)
ci1 <- survival::concordance(event_time ~ surv_prob, data = df_cens)
ci2 <- survival::concordance(surv_time ~ surv_prob, data = df_cens)
data.frame(
N = N, log_base_rate = log_base_rate, tmax = tmax,
beta = beta,
Noncensored = as.numeric(ci1$concordance),
Censored = as.numeric(ci2$concordance)
)
}Note that in this case I could just use formulas like event_time ~ age because the ordering by age is same as the ordering by survival probability. In fact, I could use anything that is a strictly monotonic function of age and get the same C-index.
Now, how does the C-index of the oracle depend on $t_ {\text{max}}$, $\log(\lambda_0)$ and $\beta$? This question is answered by the panels a, b, and c of the below plot. Below, we go through what we see in the panels one by one.

In panel a), we see that the Noncensored C-index (event times are not censored for the observed ordering) does not depend on $t_{\text{max}}$. This is as expected, because $t_{\text{max}}$ does not change the ordering of the survival probability. The Censored version is higher with small $t_{\text{max}}$, and it approaches the Noncensored version as $t_{\text{max}}$ grows. This makes sense because when we increase $t_{\text{max}}$, less and less subjects get censored. The C-index looks only at comparable pairs of subjects. The reason why the Noncensored version is higher with small $t_{\text{max}}$ could be that the subjects pairs that are left to be compared are easier to rank.
In panel b), we see same kind of behaviour as a function of $\lambda_0$. The Censored version is generally higher but approaches the Non-censored version as $\lambda_0$ increases. However, the changes in concordance index are very small both in a) and b), even though the studied ranges span drastic changes in overall hazard rate and follow-up time.
In panel c), we finally see big changes. The C-index heavily depends on the $\beta$ parameter, going from 0.5 to effectively 1 in the span of $\beta \in [0, 5]$. This parameter defines the maximum possible difference in hazard rates between subjects. Because the normalized age is uniform on the interval $[-\sqrt{3}, \sqrt{3}]$, the maximum difference in log hazard is $2 \beta \sqrt{3}$.
Something that was not yet discussed is how to handle ties for C-index, because with a unique age for each subject and $\beta>0$, there cannot be ties in the survival probability. This would change and affect the results if we discretize age to for example years, in which case the subjects with the same age have the same predicted survival probability. Also, in the special case $\beta=0$ all subjects have the same predicted survival probability. We see in panel c), that in this case the C-index is 0.5, because a tied pair is counted as half-concordant by the underlying implementation (Harrel's C).
More ties will occur if instead of year we discretize the data in to a few age groups. The number of possible unique survival probability predictions is same as the number of unique groups. Now, assume the Harrel's C definition of C-index and no censoring, just two groups, and equal group sizes. A random pair of subjects are from the same group with probability 0.5. Therefore, as the contribution of a tied pair is 0.5, the C-index of the oracle model is (proof left as an exercise for the reader)
$$
C = 0.5 \cdot 0.5 + 0.5 \frac{h_2}{h_1 + h_2} = 0.25 + 0.5 \frac{\frac{h_2}{h_1}}{1 + \frac{h_2}{h_1}}
$$
where $h_1$ and $h_ 2$ are the hazards of group 1 and 2, respectively. Now, even if the hazard ratio $ \frac{h_2}{h_1} \rightarrow \infty$, we have $C \rightarrow 0.75 < 1$.
When thinking about whether the C-index of your model is good, it helps to think what the C-index of the true data-generating hazard model would be if you generated similar data from it. The larger the difference in true hazards between the subjects is, the higher the achievable C-index.

Where more data can be bad (if you don’t handle them right)


Materials from our recent workshop on multistate modeling


