Logo
Nazad
Dzenan Hamzic, Markus Wurzenberger, Florian Skopik, Max Landauer, Andreas Rauber
1 15. 12. 2024.

Evaluation and Comparison of Open-Source LLMs Using Natural Language Generation Quality Metrics

The rapid advancement of Large Language Models (LLMs) has transformed natural language processing, yet comprehensive evaluation methods are necessary to ensure their reliability, particularly in Retrieval-Augmented Generation (RAG) tasks. This study aims to evaluate and compare the performance of open-source LLMs by introducing a rigorous evaluation framework. We benchmark 20 LLMs using a combination of established metrics such as BLEU, ROUGE, BERTScore, along with and a novel metric, RAGAS. The models were tested across two distinct datasets to assess their text generation quality. Our findings reveal that models like nous-hermes-2-solar-10.7b and mistral-7b-instruct-v0.1 consistently excel in tasks requiring strict instruction adherence and effective use of large contexts, while other models show areas for improvement. This research contributes to the field by offering a comprehensive evaluation framework that aids in selecting the most suitable LLMs for complex RAG applications, with implications for future developments in natural language processing and big data analysis.


Pretplatite se na novosti o BH Akademskom Imeniku

Ova stranica koristi kolačiće da bi vam pružila najbolje iskustvo

Saznaj više