BH akademski imenik

Ismar Kovacevic, Becir Isakovic

0 24. 2. 2026.

Evaluating LLM-Generated Synthetic Data for Fine-Tuning RoBERTa-Base on SST-2 and MRPC

International Conference on Information Technology

This paper benchmarks LLM-generated synthetic data for fine-tuning RoBERTa-base on two GLUE tasks (SST2 sentiment classification and MRPC paraphrase detection) under a low-resource setting with 1,000 real training examples per task. Real-only, synthetic-only, and hybrid (1 k real + 1 k synthetic) regimes are compared using data from eleven contemporary LLMs. Results show that synthetic-only training remains below real-only baselines, but hybrid training consistently improves performance: on SST-2, the best hybrid configuration nearly matches doubling the real data, while on MRPC gains are smaller but positive. LLM-generated text is most effective as a supplement rather than a replacement for human-labeled data.

Preuzmi PDF

Pretplatite se na novosti o BH Akademskom Imeniku