Geospatial Foundation Models (GeoFMs) are large-scale AI models trained on diverse Earth observation data to support multiple geospatial tasks. They are transforming how we understand and manage our planet. But as these models grow in complexity and capability, one question becomes critical:
Current efforts face limitations in scope, licensing, reproducibility and usability, hindering adoption and comparison. Today, ahead of the full release, we unveil the first results on GEO-Bench-2 — a benchmark designed to evaluate GeoFMs across diverse tasks and datasets. With GEO-Bench-2 we aim to provide a shared framework that accelerates innovation and collaboration in geospatial AI.
GEO-Bench-2 is more than a dataset collection. It’s a comprehensive evaluation ecosystem built to answer open research questions and guide the next generation of GeoFMs.
We’ve aggregated 19 datasets, organized into 9 subsets, each targeting specific capabilities aligned with critical research challenges. Datasets have been carefully reviewed to have a permissive license and high data quality as well as sub-sampled, if necessary, to decrease computational resources needed for evaluation.
Our protocol ensures fair, comparable, and meaningful evaluations, regardless of model architecture or pre-training strategy. It accounts for both Hyper Parameter Tuning (HPO) and repeated experiments to allow models to adapt to the specific task and account for variability across different runs. This flexibility fosters innovation while maintaining scientific rigor. To ensure an apple-to-apple comparison when aggregating across datasets, the protocol also recommends dataset-wise min-max normalization, as well as the use of bootstrapped interquartile mean to exclude outliers and account for uncertainty.
Following the GEO-Bench-2 protocol, we ran large-scale experiments using TerraTorch with more than 15,000 runs across popular open GeoFMs and compared to latest open Computer Vision approaches (See Table 2) to uncover:
Table 3 shows results using a full finetuning approach. Clay V1, which uses a smaller patch size to retain more information at a higher computational cost, takes the lead in the Core capability, followed by ConvNext-based models. As expected, general computer vision models and those focusing on RGB-only data in pretraining (i.e. DINO V3 and ConvNext) top the <10m resolution and RGB/NIR capabilities. In contrast, EO-specific models such as TerraMind and Prithvi-EO-2.0, pretrained for this type of data, lead in multispectral-dependent tasks. These are critical for applications like agriculture, environmental monitoring, and disaster response, where RGB data alone is often insufficient.
Progress thrives on transparency. Our leaderboard tracks model performance on GEO-Bench-2, enabling the community to benchmark, compare, and improve. The leaderboard allows to dive into specific capabilities and provide settings to explore both full finetuning and frozen encoder approaches.
GEO-Bench-2 comes as its own python package, available on Github with all datasets being stored in TACO format that will be available soon on the AI Alliance Hugging Face page. We built also a strong integration with TerraTorch, the most popular tool to finetune and experiment with GeoFMs. This allows to launch the benchmark, including HPO and repeated experiments, over GEO-Bench-2 datasets with just a simple config. Although this will make experimentation easier, this is not a hard requirement to submit scores to the leaderboard and users can use their favorite tools.
GEO-Bench-2 is an open initiative led by IBM and ServiceNow in collaboration with MBZUAI, NASA, ESA Φ-lab, Technical University Munich, Arizona State University, and Clark University. This is done under the AI Alliance Climate & Sustainability Group. Full release of datasets and leaderboard on Hugging Face will be provided soon, alongside a comprehensive technical report. This will explain how we obtained our results, a number of ablation studies, and considerations around model capabilities against number of parameters and pretraining dataset size. Stay tuned and help us shape the future of geospatial AI!
“By providing a shared yardstick, GEO-Bench-2 helps researchers, companies, and society navigate toward a smarter, more sustainable planet.”
Paolo Fraccaro, Alexander Lacoste, Naomi Simumba, Nils Lehmann