📊 Full opportunity report: Minerva. The opposite path. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.
TL;DR
Italy’s Minerva project, a large-scale European sovereign LLM trained from scratch, underperformed on Italian academic benchmarks despite significant investment. This challenges assumptions about the necessary scale for country-specific language models.
Italy’s Minerva-3B, a large-scale sovereign language model trained entirely from scratch on 2.5 trillion tokens with approximately 50% Italian content, scored only 4.9% on the INVALSI Italian school-exam benchmark, highlighting a significant challenge in achieving country-specific language understanding at current scale levels.
Minerva was developed by Sapienza University’s NLP group, led by Roberto Navigli, using Italy’s national supercomputing infrastructure CINECA and funded through the country’s PNRR initiative. The project trained models ranging from 350 million to 7 billion parameters, with the 3B version being publicly released along with training data and code.
Despite the large investment and extensive Italian data, Minerva-3B’s performance on the INVALSI test was near chance, at just 4.9%. Researchers concluded that while dataset composition matters, the overall size of the dataset and the number of parameters are more critical for complex language tasks. Scaling up model size and training data is essential for achieving country-specific language understanding. This result suggests that even substantial native-language data and training may not be sufficient at current parameter scales to produce deep country-specific knowledge.
Minerva.
The opposite
path.
Italy spent years building a European sovereign LLM from scratch. Then Minerva-3B scored 4.9% on the INVALSI Italian school exam.
Where AMÁLIA layered Portuguese specialization onto a multilingual foundation, Minerva trained from scratch on 2.5 trillion tokens with approximately 50% Italian content. Where AMÁLIA’s weights are not yet public, Minerva published weights, training data, and code as truly-open from day one. By every institutional measure, the Italian approach worked. But the empirical results contain a finding the press coverage has been quiet about — and it has implications that extend well beyond Italy.
Same problem. Opposite path.
European sovereign-LLM development has two primary architectural approaches. Italy chose from scratch with substantial native-language foundation. Portugal chose continuation pre-training of a multilingual model. The structural comparison surfaces what each commitment actually requires operationally.
The comparison is not “Italy did it better than Portugal.” Both projects respond to the same structural problem with different architectural strategies under different institutional and economic constraints. Italy’s national-AI investment is structurally larger by an order of magnitude — and Minerva is the visible artifact of that scale.

Large Language Models (LLMs)
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
4.9% on INVALSI. The bitter lesson surfaces.
In June 2024, researchers evaluated Minerva-3B on the Italian school-exam benchmark. The result was unambiguous. This is not a critique of Minerva — it is a critique of the public discourse around what Minerva’s empirical results actually demonstrate.

Engineering a Small AI Language Model: Training, Evaluation, and Deployment Without Myth
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
350M to 7B. Four parameter scales, one architecture.
The Minerva model family covers four parameter tiers, each with specific training corpora. Each scale level reveals what the from-scratch path actually requires at different operating points.
Italian + English
100B English
~50% English
+ 200B code

NVIDIA DGX Spark™ – Personal AI Desktop Supercomputer – Desktop GB10 Grace Blackwell Chip
Supercomputer performance directly to your desk in a compact, energy-efficient design, enabling enterprise-scale AI and high-performance computing right…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three answers. Same question.
Minerva, AMÁLIA, and OpenEuroLLM represent the three operational answers to the European sovereign-LLM question. Each makes different architectural and institutional bets. The strategic discourse benefits from treating all three as data points in the same empirical experiment.

The Ultimate Guide to Open Source Large Language Models – Practical Guide
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three standards the movement should adopt.
The structural critique generalizes beyond Minerva. The European sovereign-LLM movement benefits from internalizing these lessons across every subsequent national project. Italy modeled the openness standard; the movement should adopt it as norm.
Minerva is one valid answer to the European sovereign-LLM question. AMÁLIA is another. OpenEuroLLM is potentially a third. The strategic discourse benefits from treating all three as data points in the same empirical experiment rather than as competing national-prestige projects. More analysis like this is needed. Not less.
Implications for European Sovereign-Language Models
This development indicates that large-scale investment in native-language data alone may not guarantee the desired level of country-specific language understanding in LLMs. The results imply that scaling up model size and training data is essential, challenging the assumption that smaller, country-focused models can achieve comparable performance without significant resource commitments. The findings underscore the need for European projects to consider the importance of scale in their strategic planning for sovereign AI infrastructure.
Background on European Sovereign-Language Model Strategies
Italy’s Minerva project represents a contrasting approach to the European sovereign-LLM debate, which includes Portugal’s AMÁLIA model. While AMÁLIA focused on continuing pre-training on a multilingual foundation with a small proportion of European Portuguese data, Minerva trained from scratch on a large corpus with a significant portion of Italian content. Despite these differing strategies, both projects reveal that achieving deep, country-specific language understanding remains a challenge at current scales. Italy’s extensive investment and open data approach produced a technically impressive model, yet its performance on academic benchmarks was unexpectedly low, raising questions about the effectiveness of native-language-only training at existing parameter levels.
“While dataset composition is important, the overall size of the dataset and the number of parameters are more crucial for handling complex language tasks.”
— Research team evaluating Minerva
Unresolved Questions About Model Scaling and Performance
It remains unclear whether increasing model size beyond 7 billion parameters, or further expanding the training dataset, would significantly improve Minerva’s performance on complex language tasks. The exact thresholds at which native-language models can reliably perform at academic or professional levels are still unknown. Additionally, the broader implications for other European languages and the generalizability of these findings are yet to be determined.
Next Steps in European Sovereign-Language AI Development
The Minerva team plans to continue iterative research, including ongoing experiments with larger models and varied training strategies. Future evaluations will focus on whether scaling up model size and training data can bridge the performance gap observed on academic benchmarks. Policymakers and researchers will need to reassess resource allocations and strategic priorities, potentially emphasizing larger-scale investments to achieve country-specific language understanding. Public disclosure of further results and methodological refinements are expected in the coming months.
Key Questions
Why did Minerva perform so poorly on the Italian exam?
Despite extensive native-language data and large-scale training, Minerva’s performance was limited by the model size and possibly the complexity of the tasks. The empirical evidence suggests that current parameter scales may be insufficient for deep language understanding in specialized contexts.
Does this mean smaller models are useless for country-specific tasks?
Not necessarily. Smaller models can still be useful for many applications, but achieving deep, country-specific knowledge comparable to native speakers may require larger models and more resources than currently allocated.
What does this imply for European AI sovereignty strategies?
It suggests that European projects should consider scaling up model size and training data significantly, rather than relying solely on native-language data or smaller models, to meet their strategic goals.
Will increasing model size solve the performance issues?
It is not yet certain. While scaling is likely necessary, the exact scale needed to achieve desired benchmarks remains unknown, and other factors such as training methodology and data quality also play roles.
Are there plans to improve Minerva further?
Yes, the team is continuing research with larger models and different training approaches, aiming to enhance performance on complex language tasks.
Source: ThorstenMeyerAI.com