OpenEuroLLM: First year progress and next steps

One year has elapsed since the start of the OpenEuroLLM project. This ambitious project carried out by a consortium of 20 leading European research institutions, companies and EuroHPC centres, coordinated by Jan Hajič (Charles University, Czechia) and co-led by AMD Silo AI, has been busy with the first steps of developing next-generation open-source language models to advance European AI capabilities.

The project's main goal requires extensive research, access to high-performance computing resources, and strategic collaboration with other prominent European initiatives. During its inaugural year, the project has achieved significant milestones in advancing regional AI sovereignty through targeted efforts in digital infrastructure development, data practices, model development, and evaluation tools.

“Creating an open source multilingual LLM in the public space and within a large consortium is a challenging task. I am proud that thanks to the expertise, enthusiasm, commitment and hard work of especially the core partners the project has achieved its first-year goals. However, significant challenges, especially in securing more compute for creating the final models, still remain,” says Jan Hajič, Charles University.

Infrastructure

OpenEuroLLM is developing the digital infrastructure needed to lower thresholds for AI product development in Europe. This includes infrastructure for conducting large-scale distributed training, for running evaluations of models seamlessly across different European clusters and for building robust software stacks for experiments. In the first year of the project, these were essential steps to avoid the dependence on a single cluster and to make the most of the current configurations of European HPCs.

Data

In collaboration with Open-Sci, reference models for dataset selection and scaling trends have been developed. These reference models provide baselines for comparison to any other method trained on any of the same reference open datasets, making it easier to put a new training procedure into relation to already existing working baselines.

MixtureVitae, another significant open web-scale pretraining dataset, has been developed together with LAION, Ontocord, and Open-Sci. It has proved to be the first permissive dataset that manages to match or outperform strong non-permissive datasets like FineWeb-Edu or DCLM. It is particularly strong on reasoning problems related to mathematics and code.

Together with EuroLLM the project has tackled the challenge of lack of data that most European languages face. As current data collection cannot adequately address language scarcity, limiting proper representation of many languages in multilingual models, the first comprehensive multilingual synthetic pre-training dataset has been created.

In parallel, the project has established the basis of the OpenEuroLLM catalogue of LLM training data, a structured catalogue providing a uniform, collectively curated, and well

documented collection of candidate LLM training datasets. Datasets in the catalogue have been made publicly available (read-only) on multiple EuroHPC systems such as LUMI, Leonardo and MareNostrum to avoid duplicative efforts and redundant storage.

Models and evaluation

In collaboration with HPLT, 2B/100B reference models for various languages have been released. These transparent and easily reproducible reference models provide a means for cross-lingual comparison, inspection of monolingual performance, or understanding of popular evaluation tasks for different languages.

In addition, a range of 2B/4TT models have been trained for studying multilingual data mixes to determine the optimal proportion of each language within a training dataset for producing high-performing multilingual LLMs.

The results of both the 2B/100B and the 2B/4TT models inform future decisions as model sizes are scaled up.

Looking ahead

As the project enters its second year, transparency, openness and community collaboration continue to be guiding values while the work continues with high ambitions.

OpenEuroLLM succeeded in securing access to EuroHPC strategic compute resources, guaranteeing a substantial amount of compute on four major EuroHPC supercomputers for the remainder of the project. Additional compute resources will however be required to complement the strategic allocations.

The project is looking to release an 8B model by next summer followed by a larger model using the compute secured with the strategic compute allocation. Additionally, new iterations on the Poro model family will be released.

About the ELLIS Institute Tübingen gGmbH

Founded in 2023, the ELLIS Institute Tübingen has quickly established itself as a leading center for foundational AI research, attracting top machine learning talent and offering state-of-the-art facilities. Its mission is to advance pioneering AI research while contributing to the broader ELLIS initiative. Research areas include AI for science (in particular, physics), AI mechanisms, AI safety, probabilistic intelligence, cooperative machine intelligence, societal impacts, AutoML, computational and applied mathematics, deep models and optimization, robust machine learning, and innovative, resource-efficient AI. By integrating fundamental research with societal relevance, the institute is shaping the future of AI in Europe and globally, fostering innovation, interdisciplinary collaboration, and talent development. Almost all PIs at the ELLIS Institute Tübingen have co-affiliations at either the Max-Planck-Institute for Intelligent Systems or the Tübingen AI Center, and the ELLIS Institute thus connects to the broader Tübingen AI ecosystem.

About

Research Groups

Research Projects

People

News

Events

Press

Blog

Careers

OpenEuroLLM: First year progress and next steps