Rethinking catastrophe model evaluation by getting out of Model Land

We want to assess future risk. But because the future hasn’t happened yet, our only way of accessing knowledge about the future is by making models. Unfortunately, because models are simplifications of reality, they are inherently limited and uncertain, so we don’t just need to know what the model says, we also need some indication of how far we should trust it.

If we are interested in low-probability events or tail risks, as is often the case with natural catastrophe models used by insurers, this is going to be a difficult ask. First, we don’t have very much data, because low-probability events by definition don’t happen very often. Second, the tail risks are unlikely to be stationary – societies are evolving; businesses are transforming; the climate is changing – and so past data are not necessarily reflective even of current risk levels, let alone future levels. Third, even if we can assess risk levels in Model Land ^[1], inside our computer, what we really want to know is the risk in the real world, where we live and do business.

Going beyond: “The model was wrong!”

It’s a common story in insurance. We have a model generated using past data, telling us how often to expect a certain event. Then, a very high return period loss happens the following year. How do we respond to that? If we had a lot of a priori confidence in the model, we would say “oh dear, we were extremely unlucky!” and carry on as normal in the expectation that it will be a long time before the event happens again. If we had very little a priori confidence in the model, we would say “the model was wrong!” and either update it or throw it out in favour of some other way of determining risk in the future.

How do we decide which position to take before we get blind-sided in this way? Only by a comprehensive process of model evaluation.

The art and the science of evaluating models

Let’s start with the science. Model evaluation with respect to existing data typically consists of asking whether the model fits the data that we have. Can a forecast (or hindcast) using the model successfully predict observed values? If it can, that’s a great start – and we can quantify the success of that model in various ways.

But nobody cares how good your model is at predicting last year’s events or past data; the bottom line is that we have to develop some level of confidence about its ability to predict or any future data.

In order to do so, we don’t just need to know that the model fitted past data. We also need to demonstrate that no other plausible model could have fitted the past data equally well. And we don’t just need to know that past performance was good, we need to have some underlying compelling reason to believe that past performance is an indicator of future success.

That’s where the art comes in, and it’s a lot more difficult than the science. The science can tell us how good the model was in the past. The art is needed to make judgements about how good it will be in future. And those judgements can only come from our real-world knowledge and experience, not from within Model Land.^[1]

That’s not to say that this is an entirely subjective process. You and I probably agree with very high confidence that the laws of physics will be the same tomorrow as they are today, and so we can agree that the past performance of a weather forecast gives us good reason to have confidence in tomorrow’s weather forecast. But we might have less agreement about the confidence we should have in statistical fits to hurricane numbers, or in expectations about future interest rates or regulatory regimes. This disagreement might be related to our levels of understanding, to our political stances, or to our risk attitudes.

Evaluating the performance of AI models

That art of judgement implies limitations on the use of fully data-driven models to create believable estimates of future risk. Several teams including Google’s DeepMind and the European Centre for Medium-range Weather Forecasting are working on AI weather models, which now have comparable performance on some metrics to the current state of the art in physics-based models. This kind of forecast problem is one that we can expect AI to be extremely good at: lots of data, constant opportunity for out-of-sample testing, and (crucially) the vast majority of the time, tomorrow’s weather falls within the range of weather previously experienced and used for model calibration.

But on longer climate timescales, there is a problem: relatively little data, little opportunity for out-of-sample testing, and the climate is rapidly evolving into states which are outside anything even in the experience of the human species, let alone historical measurement.

Large Language Models have the same issue; they are pretty good at giving a neat answer to questions for which the correct answer exists in the training set, but if you ask them to come up with something outside that training set, they are liable to “hallucinate” an answer which sounds plausible but turns out to be unfounded speculation in a confidence-building format. How can we tell that models of future risk, whether statistically or physically constructed, are giving us a probable answer and not just a plausible hallucination?

To be clear, it's not that data-driven or AI models can’t make these judgements. The problem is more that they do make these judgements, but in a way that isn’t necessarily transparent and is set by the objective function, by reinforcement training, or by the constraints on the available data.

Model evaluation frameworks

If we want to manage risks effectively using models, therefore, we need a model evaluation framework which takes into account the following aspects:

First and most obviously, we need to evaluate the past performance with quantitative skill metrics suitably chosen for the relevant task. For example, information-based metrics relating to the value of the decision or to the relevant information gained;
In addition to that, we also need to make expert judgements about the degree to which past performance is reflective of future performance, based on a more holistic and experiential understanding of model quality which does not only come from the skill metrics above;
As a check that models are being used appropriately, we need some understanding of consistency in confidence levels and how the model informs different decisions. For example, in the “model was wrong” sketch above, the rejection of the model following an extreme event should be consistent with the confidence level attached to the initial pricing decision;
To connect the model through to decisions and institutional policies, we need a well-defined attitude to model risk, including the potential for model inadequacy and unexpected outcomes – where is this a threat, and where might it be framed as a business opportunity?

Clearly, perfect models are not required in order for insurance to be a profitable business, but in a competitive market, realistic and effective model evaluation can gain better insights and more decision-relevant information from the imperfect models that are available. These are the aims of a new research collaboration between WTW and LML, looking at calibration and evaluation of catastrophe models for insurance decision-making. This partnership will allow WTW to present clients with more sophisticated insights into natural catastrophe models, enhancing their ability to make informed and strategic decisions in the face of uncertainties.

Footnotes

Thompson, E. Escape from Model Land: How Mathematical Models Can Lead Us Astray and What We Can Do About It. (Basic Books, 2023), Return to article

Contact

Dr Erica Thompson

Fellow of the London Mathematical Laboratory, Associate Professor at UCL, and author of Escape From Model Land

Related insights

See all insights