Salamandra Model Card

This repository contains the model described in Salamandra Technical Report.

Salamandra is a highly multilingual model pre-trained from scratch that comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants. This model card corresponds to the 7B instructed version.

To visit the model cards of other Salamandra versions, please refer to the Model Index.

The entire Salamandra family is released under a permissive Apache 2.0 license. Along with the open weights, all training scripts and configuration files are made publicly available in this GitHub repository.

DISCLAIMER: This model is a first proof-of-concept designed to demonstrate the instruction-following capabilities of recently released base models. It has been optimized to engage in conversation but has NOT been aligned through RLHF to filter or avoid sensitive topics. As a result, it may generate harmful or inappropriate content. The team is actively working to enhance its performance through further instruction and alignment with RL techniques.

Model Details

Description

Transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data. The pre-training corpus contains text in 35 European languages and code.

Hyperparameters

The full list of hyperparameters for each model can be found here.

Architecture


Total Parameters	7,768,117,248
Embedding Parameters	1,048,576,000
Layers	32
Hidden size	4,096
Attention heads	32
Context length	8,192
Vocabulary size	256,000
Precision	bfloat16
Embedding type	RoPE
Activation Function	SwiGLU
Layer normalization	RMS Norm
Flash attention	✅
Grouped Query Attention	✅
Num. query groups	8

Intended Use

Direct Use

The models are intended for both research and commercial use in any of the languages included in the training data. The base models are intended either for language generation or to be further fine-tuned for specific use-cases. The instruction-tuned variants can be used as general-purpose assistants, as long as the user is fully aware of the model’s limitations.

Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.

Hardware and Software

Training Framework

Pre-training was conducted using NVIDIA’s NeMo Framework, which leverages PyTorch Lightning for efficient model training in highly distributed settings.

The instruction-tuned versions were produced with FastChat.

Compute Infrastructure

All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:

4x Nvidia Hopper GPUs with 64GB HBM2 memory
2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
4x NDR200 (BW per node 800Gb/s)
512 GB of Main memory (DDR5)
460GB on NVMe storage

Model	Nodes	GPUs
2B	64	256
7B	128	512
40B	256 / 512	1,024 / 2,048

How to use

The instruction-following models use the commonly adopted ChatML template:

{%- if messages[0]['role'] == 'system' %}{%- set system_message = messages[0]['content'] %}{%- set loop_messages = messages[1:] %}{%- else %}{%- set system_message = 'SYSTEM MESSAGE' %}{%- set loop_messages = messages %}{%- endif %}{%- if not date_string is defined %}{%- set date_string = '2024-09-30' %}{%- endif %}{{ '<|im_start|>system\n' + system_message + '<|im_end|>\n' }}{% for message in loop_messages %}{%- if (message['role'] != 'user') and (message['role'] != 'assistant')%}{{ raise_exception('Only user and assitant roles are suported after the initial optional system message.') }}{% endif %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}

Where system_message is used to guide the model during generation and date_string can be set to allow the model to respond with the current date.

The exact same chat template should be used for an enhanced conversational experience. The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet.

from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "BSC-LT/salamandra-7b-instruct"

text = "At what temperature does water boil?"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 device_map="auto",
 torch_dtype=torch.bfloat16
 )

message = [ { "role": "user", "content": text } ]
date_string = datetime.today().strftime('%Y-%m-%d')

prompt = tokenizer.apply_chat_template(
 message,
 tokenize=False,
 add_generation_prompt=True,
 date_string=date_string
)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using this template, each turn is preceded by a <|im_start|> delimiter and the role of the entity (either user, for content supplied by the user, or assistant for LLM responses), and finished with the <|im_end|> token.

Data

Pretraining Data

The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below. The initial three training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half, Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions. During the following epochs, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset. This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:

👁 lang distrib

The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53.05% of the total tokens. Following this, Starcoder provides 13.67%, and FineWeb-Edu (350BT subset) adds 10.24%. The next largest sources are HPLT at 4.21% and French-PD at 3.59%. Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%. These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model. The remaining 10% comes from smaller sources in various languages.

Feel free to click the expand button below to see the full list of sources.

The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part of the Colossal OSCAR dataset was replaced with FineWeb-Edu (350BT subset), resulting in 2.68T tokens per epoch; and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.

We provide an extense Datasheet section following the best practices defined by (Gebru et al., 2021).

Finetuning Data

This instructed-tuned variant has been fine-tuned with a collection of 273k instructions, focusing on the performance of Catalan, English and Spanish. However, instruction data for other closely related Iberian languages has also been included, since it yielded a positive impact on the languages of interest. That said, the performance in these additional languages is not guaranteed due to the limited amount of available data and the lack of resources for thorough testing.

Dataset	ca	en	es	eu	gl	pt	Total
alpaca-cleaned	49,950	49,950
aya-dataset	3,941	3,851	939	8,995	17,726
coqcat	4,797	4,797
databricks-dolly-15k	15,011	15,011
dolly-ca	3,232	3,232
flores-dev	986	1,037	1,964	493	505	4,985
mentor-ca	7,119	7,119
mentor-es	7,122	7,122
no-robots	9,485	9,485
oasst-ca	2,517	2,517
oasst2	750	31,086	15,438	190	197	1,203	48,864
open-orca	49,996	49,996
rag-multilingual	16,043	14,997	11,263	42,303
tower-blocks	7,762	1,000	1,000	9,762
Total	35,444	183,265	40,638	1,622	702	11,198	272,869

Evaluation

Gold-standard benchmarks

Evaluation is done using the Language Model Evaluation Harness (Gao et al., 2024). We evaluate on a set of tasks taken from SpanishBench, CatalanBench, BasqueBench and GalicianBench. These benchmarks include both new and existing tasks and datasets. Given that this is an instructed model, we add LM Evaluation Harness's native feature of chat-template to the setup. In the tables below, we include the results in a selection of evaluation datasets that represent model's performance across a variety of tasks within these benchmarks.

We only use tasks that are either human generated, human translated, or with a strong human-in-the-loop (i.e., machine translation followed by professional revision or machine generation followed by human revision and annotation). This is the reason behind the variety in number of tasks reported across languages. As more tasks that fulfill these requirements are published, we will update the presented results. We also intend to expand the evaluation to other languages, as long as the datasets meet our quality standards.

During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the transformers library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.

It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the models capabilities and potential. We thus advise caution when reading and interpreting the results.

A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation are available in the technical report.

All results reported below are on a 0-shot setting.

Spanish

Category	Task	Metric	Result
Commonsense Reasoning	xstorycloze_es	acc	68.17
NLI	wnli_es	acc	56.34
NLI	xnli_es	acc	46.95
Paraphrasing	paws_es	acc	64.25
QA	xquad_es	acc	36.22
Translation	flores_es	bleu	19.29

Catalan

Category	Task	Metric	Result
Commonsense Reasoning	copa_ca	acc	82.20
Commonsense Reasoning	xstorycloze_ca	acc	70.75
NLI	wnli_ca	acc	60.56
NLI	xnli_ca	acc	57.04
Paraphrasing	parafraseja	acc	66.25
Paraphrasing	paws_ca	acc	67.55
QA	arc_ca_easy	acc	68.77
	arc_ca_challenge	acc	42.49
	openbookqa_ca	acc	37.00
	piqa_ca	acc	71.22
	siqa_ca	acc	47.85
Translation	flores_ca	bleu	24.13

Basque

Category	Task	Metric	Result
Commonsense Reasoning	xcopa_eu	acc	62.80
Commonsense Reasoning	xstorycloze_eu	acc	62.14
NLI	wnli_eu	acc	50.70
NLI	xnli_eu	acc	49.83
QA	eus_exams	acc	39.13
	eus_proficiency	acc	38.94
	eus_trivia	acc	47.81
Reading Comprehension	eus_reading	acc	47.16
Translation	flores_eu	bleu	13.02

Galician

Category	Task	Metric	Result
Paraphrasing	parafrases_gl	acc	56.46
Paraphrasing	paws_gl	acc	61.50
QA	openbookqa_gl	acc	32.40
Translation	flores_gl	bleu	22.35

LLM-as-a-judge

We use Prometheus-2 8x7B as a judge to evaluate the responses of the model. Tasks are created from existing multilingual evaluation datasets covering the same categories as the ones measured in our gold-standard benchmarks. We randomly select a subset of 250 instances per language from the test set of each source dataset. To evaluate the responses of our model, we use task-specific criteria developed in-house for the LLM-judge to use. Each criterion is measured either as a 5-point Likert scale or as a binary task depending on the idiosyncrasy of the task and criterion.

Prompts for each task are created in various ways to score the model's robustness in addition to these criteria. This is done by presenting the same source instance within three different prompts. We then calculate the variance between the scores assigned by the LLM-judge to our model's responses to the three prompt styles and average it across all instances. Prompts are human translated to all languages measured. We do not provide the LLM-judge with a reference answer.

The judge prompt we use during evaluation is the same used to fine tune the Prometheus-2 family. We keep the judge prompt and criteria used to present the LLM-judge with the task prompts and model responses in English for evaluation across languages. The judge prompt used is:

"You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance.

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between {a} and {b}. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between {a} and {b})\"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{input}

###Response to evaluate:
{prediction}

###Score Rubrics:
{criteria}

###Feedback:"

As an example, prompts for the Math task in English are based on instances from MGSM, and each instance is presented within these prompts:

"en": [
 ("I need help with this math problem: \"", "\" Give me the answer step by step and also the final result separately."),
 ("Can you please help me answer this? \"", "\" Explain the answer and give me the final result as well. Thanks."),
 ("Help me with this problem: \"", "\" I need the answer explained and the final result separately.")
]

This task is then evaluated by the LLM-judge using two criteria, reasoning capability (5-point Likert) and mathematical correctness (binary):

reasoning_capability_criteria = {
 "reasoning_capability": """
[Does the model's answer demonstrate reasoning capability?]
Score 1: The answer demonstrates poor reasoning, with illogical arguments or conclusions that do not follow from the provided information.
Score 2: The answer shows weak reasoning, with some logical connections but also contains significant flaws or gaps in the argumentation.
Score 3: The answer demonstrates adequate reasoning, with generally logical arguments, but may have minor flaws or a lack of depth in the reasoning process.
Score 4: The answer shows strong reasoning, with well-structured arguments and conclusions that logically follow from the information provided.
Score 5: The answer demonstrates exceptional reasoning, with clear, coherent, and insightful arguments that are logically sound and well-supported by the information provided."""
}

mathematical_correctness_binary_criteria = {
 "mathematical_correctness_binary": """
[Is the model's answer mathematically correct?]
Score 0: The answer contains mathematical errors that render the solution incorrect or unreliable.
Score 1: The answer is mathematically correct, with accurate calculations and appropriate use of mathematical concepts."""
}

Multilingual results

Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a (B) after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by / shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 mean that the model generates similar responses when comparing the three prompt varieties for a single instance.

Further details on all tasks and criteria, a full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.

Category	Dataset	Criteria	es	ca	gl	eu	en
Commonsense Reasoning	XStoryCloze	Ending coherence	3.24/0.63	3.12/0.51	2.87/0.59	2.16/0.52	3.71/0.50
Paraphrasing	PAWS	Completeness `(B)`	0.86/0.07	0.82/0.09	0.78/0.10	-- / --	0.92/0.05
		Paraphrase generation	3.81/0.54	3.67/0.55	3.56/0.57	-- / --	3.98/0.37
		Grammatical correctness `(B)`	0.93/0.03	0.92/0.05	0.89/0.06	-- / --	0.96/0.03
Reading Comprehension	Belebele	Passage comprehension	3.43/0.43	3.28/0.50	3.02/0.56	2.61/0.43	3.43/0.58
Reading Comprehension	Belebele	Answer relevance `(B)`	0.86/0.05	0.84/0.05	0.75/0.08	0.65/0.11	0.83/0.06
Extreme Summarization	XLSum & caBreu & summarization_gl	Informativeness	3.37/0.34	3.57/0.31	3.40/0.31	-- / --	3.32/0.26
Extreme Summarization	XLSum & caBreu & summarization_gl	Conciseness	3.06/0.34	2.88/0.50	3.09/0.38	-- / --	3.32/0.22
Math	MGSM	Reasoning capability	3.29/0.72	3.16/0.65	3.33/0.60	2.56/0.52	3.35/0.65
Math	MGSM	Mathematical correctness `(B)`	0.68/0.12	0.65/0.13	0.73/0.11	0.59/0.13	0.67/0.12
Translation form Language	FLORES-200	Fluency	3.95/0.11	3.88/0.15	-- / --	-- / --	3.92/0.14
Translation form Language	FLORES-200	Accuracy	4.22/0.15	4.25/0.21	-- / --	-- / --	4.25/0.23
Translation to Language	FLORES-200	Fluency	3.92/0.11	3.84/0.14	-- / --	-- / --	4.19/0.14
Translation to Language	FLORES-200	Accuracy	4.31/0.16	4.18/0.20	-- / --	-- / --	4.63/0.15

Ethical Considerations and Limitations

We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases, we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019). We report that while performance is high (accuracies around 0.8 depending on the social category) in disambiguated settings, the model performs very poorly in ambiguous settings, which indicates the presence of societal biases that need to be further addressed in post-training phases.

Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings. For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant, but relatively weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers. We measure the effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects, with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.

We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.

These results can be expected from a model that has undergone only a preliminary instruction tuning. These tests are performed in order to show the biases the model may contain. We urge developers to take them into account and perform safety testing and tuning tailored to their specific applications of the model.

Additional information

Author

The Language Technologies Unit from Barcelona Supercomputing Center.

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

Funding

This work has been promoted and financed by the Government of Catalonia through the Aina Project.

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of ILENIA Project with reference 2022/TL22/00215337.

Acknowledgements

This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support.

In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.

At the national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.

At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.

Their valuable efforts have been instrumental in the development of this work.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

@misc{gonzalezagirre2025salamandratechnicalreport,
 title={Salamandra Technical Report}, 
 author={Aitor Gonzalez-Agirre and Marc Pàmies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and José Javier Saiz and Ferran Espuña and Jaume Prats and Javier Aula-Blasco and Mario Mina and Adrián Rubio and Alexander Shvets and Anna Sallés and Iñaki Lacunza and Iñigo Pikabea and Jorge Palomar and Júlia Falcão and Lucía Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle Ruíz-Fernández and Marta Villegas},
 year={2025},
 eprint={2502.08489},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2502.08489}, 
}