AI and weather forecasting: the first wave of AI-based weather models

Fri 10 May 2024

9 minutes read

Image generated by AI

In our previous article on AI-based weather forecasting, we sketched the technologies underlying these so-called MLWP models. We saw that many innovations in the AI landscape of the last decade have trickled down into modelling how the atmosphere evolves. According to ECMWF, some of these models may be legitimate rivals to its NWP system, IFS (ECMWF, 2023). Today, the focus lies on the first wave of AI-based weather models. Firstly, we’ll explore the approaches that have led to these models – namely, those that model at a one-degree resolution. This is followed by a look at Pangu-Weather, FourCastNet and GraphCast – and we especially zoom in on their performance compared to IFS. Finally, we’ll discuss how our weather experts are already using these models experimentally, and what their experiences are so far.

Early approaches with limited results

After 2012, neural networks emerged as go-to models for many computer vision problems (Infoplaza, 2024). It is therefore unsurprising to find that early pioneers had started experimenting with neural network based weather modelling. For example, already in 2018, Dueben and Bauer already attempted using these networks for making global weather forecasts. These were fairly limited in scope and performance: their model would only forecast the geopotential height at around 5.5 kilometers and error would accumulate rapidly.

Still, these were the first steps that provided some answers to fundamental questions like “what dataset do we need?” and “is modelling the weather using neural networks in line with physics?”. Only one year later, Weyn et al. (2019) introduced a new approach that would eventually lead to better performance compared to a coarse-resolution NWP model while predicting more variables (Weyn et al., 2020). Finally, in 2021, Rasp & Thuerey took a different approach while modelling the weather, as they attempted to build climatological understanding first before finetuning with weather data (Rasp & Thuerey, 2021). Still, models were fairly limited: they would typically have a large resolution, making them unusable for operational forecasting, they would have limited variables, or they would have errors larger than NWP systems. 

Pangu-Weather, FourCastNet and GraphCast: quarter-degree models

This changed in 2022, when various results were published that were significant improvements to the early approaches (Ben Bouallègue et al., 2024). In the previous article, we’ve learned about graph neural networks – and Keisler (2022) heavily used this approach, decreasing forecast resolution to 1 degree. This was rapidly used by a further decrease in resolution after the introduction of Pangu-Weather, FourCastNet and GraphCast.

Model name Authors How it works

Bi et al. (2022)

Pangu-Weather considers the atmospheric variables to be images containing weather patterns, learning to capture these.

Feed the model many variables at time T to predict for time T + X, where X is 1, 3, 6 or 24 (thus effectively constituting 4 different models, used together in an intelligent way). All inputs are considered to be images and cut into patches, fed through the model, after which output patches are recombined into a weather forecast for the next time step.
FourCastNet v1 and v2 Pathak et al. (2022); Bonev et al. (2023). FourCastNet works similarly as Pangu-Weather, but then by using neural operators – effectively attempting to learn the functions that bring forward these patterns rather than the patterns themselves.
GraphCast Lam et al. (2023) GraphCast considers the atmosphere to be represented as a graph, for which it then learns to pass messages between grid points at various resolutions. Effectively, this is analogous to one point informing the other that weather is coming their way.

Comparing their performance vs ECMWF IFS

To assess how well these models are performing, it’s key to benchmark them. We will discuss benchmarking AI weather models in more detail in a forthcoming article, but WeatherBench 2 is a good choice for this. It “is a framework for evaluating and comparing data-driven and traditional numerical weather forecasting models” and contains evaluation datasets and evaluation code (WeatherBench, n.d.). This helps ensure that all performance analyses are done in the same way and that apples are compared to apples. 

Comparing apples to apples is important for comparison between AI-based weather models, but the same is true for comparing these models with traditional NWP models. In particular, how weather models are initialized allows them to be compared, as specific model behavior is then not related to differences between initial conditions. In WeatherBench, this is accounted for by so-called ERA5 forecasts, which “provide a like-for-like baseline for an AI model initialized from and evaluated against ERA5” (WeatherBench, 2024). It runs the IFS model used for creating the ERA5 dataset (i.e., the dataset used for training many AI weather models) by initializing with ERA5 data.

For this reason, in what follows, besides showing the performance of the IFS against actual analyses, we also show its performance when initialized with ERA5 forecasts, to allow for better AI/NWP comparisons. We use the highest available resolution, as this is most relevant for operational weather forecasting. Unfortunately, the FourCastNet series is currently not present within WeatherBench.

Finally, the goal of what follows is to provide an initial feeling as to how well these models are performing. There are more works, among them Ben Bouallègue et al. (2024), which study their performance in more detail. What’s more, all the evaluated models can also be observed via WeatherBench.

Global performance

To get an initial idea about model performance, let’s look at the Root Mean Square Error (RMSE) for three variables closer to the Earth’s surface, namely wind speed, temperature and air pressure. This score is computed by taking many forecasts, for each taking the predicted value and the observed value (by consequence of vs ERA5 in the charts, we know that the observed value is taken from the corresponding ERA5 data), subtracting the first from the latter. These differences are then averaged and subsequently its square root is taken. This metric penalizes large errors, meaning that extreme errors are visible by means of higher error rates. In other words, the lower the score, the better. 

10-meter wind speed, global RMSE in m/s (WeatherBench, 2024b).

2-meter temperature, global RMSE in Kelvin (WeatherBench, 2024b).

Sea-level pressure, global RMSE in Pascal (WeatherBench, 2024b).

When looking at these initial scores, we can make a few observations:
•    Models are on par with each other in the first 24 hours. For some variables, however, NWP models have (significantly) lower error rates, meaning that they perform better.
•    In the medium term, significant differences can be observed between NWP (notably, the ERA5-Forecasts and IFS HRES forecasts) and AI-based methods. Typically, AI-based methods have lower error rates than NWP methods.
•    Differences between AI-based methods and NWP methods tend to become larger over time.

Unfortunately, no metric tells the entire story. If we compute forecast bias, i.e. compute whether values were too low or too high on average, we see something different. NWP methods tend to have relatively low biases compared to AI-based methods. Even though this suggests that AI-based methods have lower errors in most cases, they tend to miss extreme situations. When looking in the literature, we find that this is a limitation of many of the first generation of AI-based weather models. For example, it was already indicated by Bi et al. (2022) in their work on Pangu-Weather, suggesting that the model showed competitive performance to NWP models when computing the position of storm systems, while struggling with extreme wind speeds. Here, the added value of a meteorologist and their professional advice is crystal clear. 
10-meter wind speed, global bias in m/s (WeatherBench, 2024b).

2-meter temperature, global bias in Kelvin (WeatherBench, 2024b).

Sea-level pressure, global bias in Pascal (WeatherBench, 2024b).

Varying performance by region and time

The previous charts show global performance averaged over the year. However, performance of no model is stable. There can be regional differences, as for example the performance of various weather models diminishes towards the poles (Schroeter, 2024). Similarly, there can be differences between various periods of the year, as indicated by WeatherBench (2024c). It is important to take these into account when assessing the performance – for example, for developing intuition about when to trust what model while making a forecast. More scores can be observed through the WeatherBench website.

The temporal performance scores of the models when analyzed for 2020 (WeatherBench, 2024c).

Using AI weather models: experiences from our weather experts

A crucial algorithm that we meteorologists at Infoplaza use in forecasting is the Smart Automaster (SAM). This is an in-house developed algorithm that averages new data from various models with the latest issued forecast to filter out extreme volatility between different model runs. Additionally, utilizing SAM has another significant advantage. Different models excel in different weather situations. For instance, on average, ECMWF is the best model and is often used as a standard. However, verification shows that, for example, GFS performs better than ECMWF in periods with strong winds and high waves. Moreover, a high-resolution model such as Harmonie is useful for predicting mist and convection. All these models are combined in SAM to generate the best possible forecast for our clients. The output of SAM is always checked and adjusted by the duty meteorologist if necessary.

Furthermore, meteorologists are increasingly using AI models in operational settings. We observe these weather models are rapidly improving, and it's important to start working with them now to gain experience. Operationally, we notice that the resolution, especially for nearshore operations, often needs improvement. High-resolution models like Harmonie and SWAN provide more accurate results in such cases. Additionally, the meteorologist on duty often observes that extremes are underestimated by these models. However, over the longer term (2-7 days ahead), these models perform well. Moreover, the output of AI models seems to be more consistent across different runs.

Despite AI models currently being mainly in an experimental phase of use, it's crucial to start working with them now. The strength of a meteorologist lies in understanding how models operate. Thus, a meteorologist can combine the strengths of different models to create a reliable forecast. Gaining experience with AI models is therefore crucial to further improving these models now and potentially using them operationally in the future.

Next articles in our series

These are the forthcoming articles in our series about AI-based weather forecasting.

Next generation AI-based weather models    
Introducing ECMWF’s AIFS model, FuXi and the FengWu series, which are newer generation AI-based weather models that attempt to overcome certain limitations. 

Evaluating AI-based weather models with WeatherBench
Introduces the WeatherBench benchmarking suite in more detail, which can be used to consistently evaluate AI-based weather models.

Going beyond analyses - using observations more directly
Today’s AI-based weather models are reliant on NWP analyses – but new approaches are trying to work around this limitation.



Ben Bouallègue, Z., Clare, M. C., Magnusson, L., Gascon, E., Maier-Gerber, M., Janoušek, M., ... & Pappenberger, F. (2024). The rise of data-driven weather forecasting: A first statistical assessment of machine learning-based weather forecasts in an operational-like context. Bulletin of the American Meteorological Society.

Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., & Tian, Q. (2022). Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast. arXiv preprint arXiv:2211.02556.

Bonev, B., Kurth, T., Hundt, C., Pathak, J., Baust, M., Kashinath, K., & Anandkumar, A. (2023, July). Spherical fourier neural operators: Learning stable dynamics on the sphere. In International conference on machine learning (pp. 2806-2823). PMLR.

Dueben, P. D., & Bauer, P. (2018). Challenges and design choices for global weather and climate models based on machine learning. Geoscientific Model Development, 11(10), 3999-4009.

ECMWF. (2023). The rise of machine learning in weather forecasting.

Infoplaza. (2024, April 25). AI and weather forecasting: A deep dive into MLWP technology. Infoplaza - Guiding you to the decision point.

Keisler, R. (2022). Forecasting global weather with graph neural networks. arXiv preprint arXiv:2202.07575.

Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., ... & Battaglia, P. (2022). GraphCast: Learning skillful medium-range global weather forecasting. arXiv preprint arXiv:2212.12794. 
Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mardani, M., ... & Anandkumar, A. (2022). Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. arXiv preprint arXiv:2202.11214.

Rasp, S., & Thuerey, N. (2021). Data‐driven medium‐range weather prediction with a resnet pretrained on climate simulations: A new model for weatherbench. Journal of Advances in Modeling Earth Systems, 13(2), e2020MS002405.

Schroeter, B. (2024). Towards improved modelling of the high southern latitudes (Doctoral dissertation, University of Tasmania).

WeatherBench. (n.d.). Why WeatherBench? — WeatherBench 2 documentation. Why WeatherBench? — WeatherBench 2 documentation.
WeatherBench. (2024). Faq. FAQ.

Deterministic scores – WeatherBench2. (2024b). Deterministic scores – WeatherBench2.

WeatherBench. (2024c). Temporal scores. Temporal scores.

Weyn, J. A., Durran, D. R., & Caruana, R. (2019). Can machines learn to predict weather? Using deep learning to predict gridded 500‐hPa geopotential height from historical weather data. Journal of Advances in Modeling Earth Systems, 11(8), 2680-2693.

Weyn, J. A., Durran, D. R., & Caruana, R. (2020). Improving data‐driven global weather prediction using deep convolutional neural networks on a cubed sphere. Journal of Advances in Modeling Earth Systems, 12(9), e2020MS002109.

Stay up to date:
guiding you to the decision point

Sign up to receive trusted information and join 4,500+ maritime, traffic, public transport and metocean professionals.