Evaluating Large Language Models With Human Feedback: Establishing A Swedish Benchmark
Evaluating Large Language Models With Human Feedback: Establishing A Swedish Benchmark
Evaluating Large Language Models With Human Feedback: Establishing A Swedish Benchmark
Abstract
In the rapidly evolving field of artificial intelligence, large language mod-
els (LLMs) have demonstrated significant capabilities across numerous appli-
cations. However, the performance of these models in languages with fewer
resources, such as Swedish, remains under-explored. This study introduces a
comprehensive human benchmark to assess the efficacy of prominent LLMs
in understanding and generating Swedish language texts using forced choice
ranking. We employ a modified version of the ChatbotArena benchmark,
incorporating human feedback to evaluate twelve different models, including
GPT-4, GPT-3.5, various Claude and Llama models, and bespoke models like
Dolphin-2.9-llama3b-8b-flashback and BeagleCatMunin. These models were
chosen based on their performance on LMSYS chatbot arena and the Scande-
val benchmarks. We release the chatbotarena.se benchmark as a tool to improve
our understanding of language model performance in Swedish with the hopes
that it will be widely used. We aim to create a leaderboard once sufficient data
has been collected and analysed.
1 Introduction
Large language models (LLMs) have demonstrated exceptional capabilities across
various applications, significantly advancing the field of natural language processing.
However, their effectiveness in low-resource languages remains underexplored. The
majority of these models are optimized for English or other high-resource languages,
leading to a notable performance disparity when applied to less commonly used lan-
guages. This gap in model performance has significant implications:
1
• Accessibility: Individuals who are native speakers of low-resource languages
but not proficient in English are less able to benefit from the advancements in
AI and machine learning. This creates a barrier to accessing technology that
could otherwise support education, business, and communication.
• Potential model improvements: Since models are not primarily tailored for
the language, improved understanding in combination with fine-tuning tech-
niques has the potential to significantly advance model performance for the
target language.
2
than a select few. This inclusivity helps prevent biases that may otherwise
emerge if only a limited demographic is considered.
Swedish Chatbot Arena uses forced ranking between two models to create a bench-
mark of model performance using ELO ratings. This ensures a fair evaluations of
models while ensuring that a broad aspect of the public can be a part in democrati-
cally evaluating LLM performance.
• Efficiency: Automatic tools can evaluate models much more quickly than
human-based assessments. This speed is crucial for iterating over model designs,
allowing developers to refine and test models continuously without significant
delays.
3
• Consistency: These tools apply the same standards and methodologies across
different models, ensuring that the evaluations are consistent. This uniformity
is essential for fair comparisons between models, facilitating a clearer under-
standing of each model’s strengths and weaknesses.
The use of automated evaluation tools is crucial in the LLM development lifecycle
as they provide rapid feedback that is essential for timely and effective model training
and refinement. By leveraging these tools, developers can accelerate the development
process and enhance the overall quality and effectiveness of their models.
2 Method
2.1 Evaluation platform
Our software evaluation platform is a fork of the LMSYS project which currently runs
on a RTX-4090 GPU with enough memory to serve two open source models.
4
2.2 Model Selection
Twelve language models were initially selected for inclusion based on evaluation on
LMSYS and Scandeval on Swedish-language tasks. Limitations such as availability in
the EU / availability through an API limited model selection. A majority of models
were served through an API, 10 models while two models Dolphin-2.9-llama3b-8b-
flashback and BeagleCatMunin was served through a local GPU.
The models currently included in the benchmark can be viewed in table 1.
5
[Anthropic(2023)] Anthropic used a technique called Constitutional AI that uses
RLAIF for feedback for alignment. [Bai et al.(2022)Bai, Kadavath, Kundu, Askell, Kernion, Jones, C
The Opus overtook GPT-4 to become the best performing model on the LMSYS
benchmark before the release of GPT-4.5. The models are highly capable and al-
though they are not reviewed on Scandeval benchmark, they deserve to be reviewed
further in this evaluation.
2.5.1 Dolphin-2.9-llama3b-8b-flashback
Several fine-tunes and merges have been done on Llama3 models for Swedish. Dolphin-
2.9-llama3b-8b-flashback is a high performing example of a lllama-3 fine tune. The
model is a merge of timpal0l/Llama-3-8B-flashback-v1 and cognitivecomputations/dolphin-
2.9-llama3-8b which in turn are fine-tunes on Swedish language data (Flashback) and
English language data. Notably the underlying dolphin model is uncensored to re-
move alignment and bias. This model is part of the the benchmark for research
purpose but can generate unethical responses.
2.6.2 AI-Sweden-20B
AI-Sweden 20b is a model created by AI Sweden and trained on the Nordic Pile
dataset.
6
3 Types of models evaluated
Since the benchmark contains both open source and closed source, not all knowledge
regarding training of the models is available. However, based on information regarding
the open models some conclusions regarding models can be drawn.
4 Next steps
We hope that everyone will help out in our work on improving Swedish Language
models by evaluating models on chatbotarena.se, the home of our benchmark.
Once evaluations are done and we have gathered enough data, we will present a
benchmark of the most performant models for Swedish.
7
4.1 Data
Our benchmark relies on forced binary ranking of two models which gives as output
a preference for an answer over another. This output is useful both as assessment of
model performance but also as a way to create Human Preference Data that can be
used to improve model performance.
References
[Anthropic(2023)] Anthropic. Model card for claude 3.
https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card
2023. Accessed: 2024-05-02.
[Bai et al.(2022)Bai, Kadavath, Kundu, Askell, Kernion, Jones, Chen, Goldie, Mirhoseini, McKinnon
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson
Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron
McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez,
Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie
Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile
Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi
Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott
Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham,
Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R.
Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam
McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness
from ai feedback, 2022.
[Chiang et al.(2024)Chiang, Zheng, Sheng, Angelopoulos, Li, Li, Zhang, Zhu, Jordan, Gonzalez, and
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos,
Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E.
Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms
by human preference, 2024.
1
https://github.com/BirgerMoell/SwedishLLMBenchmark
8
[Meta(2023)] Meta. Llama 3 release blog. https://ai.meta.com/blog/meta-llama-3/,
2023. Accessed: 2024-05-02.
[Yadav et al.(2023)Yadav, Tam, Choshen, Raffel, and Bansal] Prateek Yadav, Derek
Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving
interference when merging models, 2023.